Background and Overview

DataCamp offer interactive courses related to R Programming. While some is review, it is helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:

This document is currently split between _v003 and v_003_a due to the need to keep the number of DLL that it opens below the hard-coded maximum. This introductory section needs to be re-written, and the contents consolidated, at a future date.

The original DataCamp_Insights_v001 and DataCamp_Insights_v002 documents have been split for this document:

Working with Dates and Times in R

Chapter 1 - Dates and Times in R

Introduction to dates - including the built-in methods for R:

  • Differences in M-D-Y and D-M-Y
  • ISO8601 is a standard for dates - components should be decreasing such as YYYY-MM-DD
    • The numbers should all be padded with leading zeroes
    • A separator is not required, but it must be a dash (-) if used
  • R will generally require input using as.Date()
  • Some functions that read in data will automatically recognize and parse dates in a variety of formats
    • In particular the import functions, like read_csv(), in the readr package will recognize dates in a few common formats
    • There is also the anytime() function in the anytime package whose sole goal is to automatically parse strings as dates regardless of the format

Why use dates?

  • Behind the scenes, dates are stored as the number of days since 1970-01-01
    • Can compare dates, take differences of dates, use dates for plotting, and the like
  • R releases have a major, minor, and patch
    • Patch starts at zero with a new minor and increments by 1
    • Minor starts at zero with a new major and incerements by 1

What about times?

  • R also has the built-in capability to handle datetimes
  • ISO8601 has standards for datetimes also - YYYYMMDD HH:MM:SS
  • Two capabilities for storing times in R
    • POSIXlt - list with named components
    • POSIXct - seconds since 1970-01-01 00:00:00 (typically better for data frames, and focus of this module)
  • Can convert to POSIXct using as.POSIXct()
  • Can pass a timezone, and the default assumption is local time
    • If the string is passed as YYYYMMDDTHH:MM:SSZ then the assumption is made of Zulu (UTC) time
  • One drawback is that as.POSIXct() does not naturally recognize the timezones, so some additional work is required to properly enter a datetime

Why lubridate?

  • The lubridate package is designed to make it easier to work with dates and times
    • Part of the tidyverse - designed for humans, and integrates nicely to data analysis pipelines
    • Consistent behavior regardless of the underlying objects
  • Easier to use, and more forgiving of formats
  • Has capability for time spans (time between two times, such as time for reign of monarchs)

Example code includes:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)


# The date R 3.0.0 was released
x <- "2013-04-03"

# Examine structure of x
str(x)
##  chr "2013-04-03"
# Use as.Date() to interpret x as a date
x_date <- as.Date(x)

# Examine structure of x_date
str(x_date)
##  Date[1:1], format: "2013-04-03"
# Store April 10 2014 as a Date
april_10_2014 <- as.Date("2014-04-10")


# Load the readr package
library(readr)

# Use read_csv() to import rversions.csv
releases <- read_csv("./RInputFiles/rversions.csv")
## Parsed with column specification:
## cols(
##   major = col_integer(),
##   minor = col_integer(),
##   patch = col_integer(),
##   date = col_date(format = ""),
##   datetime = col_datetime(format = ""),
##   time = col_time(format = ""),
##   type = col_character()
## )
# Examine the structure of the date column
str(releases$date)
##  Date[1:105], format: "1997-12-04" "1997-12-21" "1998-01-10" "1998-03-14" ...
# Load the anytime package
library(anytime)

# Various ways of writing Sep 10 2009
sep_10_2009 <- c("September 10 2009", "2009-09-10", "10 Sep 2009", "09-10-2009")

# Use anytime() to parse sep_10_2009
anytime(sep_10_2009)
## [1] "2009-09-10 CDT" "2009-09-10 CDT" "2009-09-10 CDT" "2009-09-10 CDT"
# Set the x axis to the date column
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major)))

# Limit the axis to between 2010-01-01 and 2014-01-01
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major))) +
  xlim(as.Date("2010-01-01"), as.Date("2014-01-01"))
## Warning: Removed 87 rows containing missing values (geom_path).

# Specify breaks every ten years and labels with "%Y"
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major))) +
  scale_x_date(date_breaks = "10 years", date_labels = "%Y")

# Find the largest date
last_release_date <- max(releases$date)

# Filter row for last release
last_release <- filter(releases, date == last_release_date)

# Print last_release
last_release
## # A tibble: 1 x 7
##   major minor patch date       datetime            time   type 
##   <int> <int> <int> <date>     <dttm>              <time> <chr>
## 1     3     4     1 2017-06-30 2017-06-30 07:04:11 07:04  patch
# How long since last release?
Sys.Date() - last_release_date
## Time difference of 261 days
# Use as.POSIXct to enter the datetime 
as.POSIXct("2010-10-01 12:12:00")
## [1] "2010-10-01 12:12:00 CDT"
# Use as.POSIXct again but set the timezone to `"America/Los_Angeles"`
as.POSIXct("2010-10-01 12:12:00", tz = "America/Los_Angeles")
## [1] "2010-10-01 12:12:00 PDT"
# Use read_csv to import rversions.csv
releases <- read_csv("./RInputFiles/rversions.csv")
## Parsed with column specification:
## cols(
##   major = col_integer(),
##   minor = col_integer(),
##   patch = col_integer(),
##   date = col_date(format = ""),
##   datetime = col_datetime(format = ""),
##   time = col_time(format = ""),
##   type = col_character()
## )
# Examine structure of datetime column
str(releases$datetime)
##  POSIXct[1:105], format: "1997-12-04 08:47:58" "1997-12-21 13:09:22" ...
# Import "cran-logs_2015-04-17.csv" with read_csv()
logs <- read_csv("./RInputFiles/cran-logs_2015-04-17.csv")
## Parsed with column specification:
## cols(
##   datetime = col_datetime(format = ""),
##   r_version = col_character(),
##   country = col_character()
## )
# Print logs
logs
## # A tibble: 100,000 x 3
##    datetime            r_version country
##    <dttm>              <chr>     <chr>  
##  1 2015-04-16 22:40:19 3.1.3     CO     
##  2 2015-04-16 09:11:04 3.1.3     GB     
##  3 2015-04-16 17:12:37 3.1.3     DE     
##  4 2015-04-18 12:34:43 3.2.0     GB     
##  5 2015-04-16 04:49:18 3.1.3     PE     
##  6 2015-04-16 06:40:44 3.1.3     TW     
##  7 2015-04-16 00:21:36 3.1.3     US     
##  8 2015-04-16 10:27:23 3.1.3     US     
##  9 2015-04-16 01:59:43 3.1.3     SG     
## 10 2015-04-18 15:41:32 3.2.0     CA     
## # ... with 99,990 more rows
# Store the release time as a POSIXct object
release_time <- as.POSIXct("2015-04-16 07:13:33", tz = "UTC")

# When is the first download of 3.2.0?
logs %>% 
  filter(r_version == "3.2.0")
## # A tibble: 35,928 x 3
##    datetime            r_version country
##    <dttm>              <chr>     <chr>  
##  1 2015-04-18 12:34:43 3.2.0     GB     
##  2 2015-04-18 15:41:32 3.2.0     CA     
##  3 2015-04-18 14:58:41 3.2.0     IE     
##  4 2015-04-18 16:44:45 3.2.0     US     
##  5 2015-04-18 04:34:35 3.2.0     US     
##  6 2015-04-18 22:29:45 3.2.0     CH     
##  7 2015-04-17 16:21:06 3.2.0     US     
##  8 2015-04-18 20:34:57 3.2.0     AT     
##  9 2015-04-17 18:23:19 3.2.0     US     
## 10 2015-04-18 03:00:31 3.2.0     US     
## # ... with 35,918 more rows
# Examine histograms of downloads by version
ggplot(logs, aes(x = datetime)) +
  geom_histogram() +
  geom_vline(aes(xintercept = as.numeric(release_time)))+
  facet_wrap(~ r_version, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Chapter 2 - Parsing and Manipulating Dates with lubridate

Parsing dates with lubridate:

  • lubridate::ymd() will manage dates in format ymd, even if they are not properly ISO formatted (have separators, English abbreviations, and the like)
    • Analogous behaviors from ydm(), mdy(), myd(), dmy(), dym(), day_hm()
    • Assumes UTC unless otherwise specified
    • All the functions with y, m and d in any order exist
    • If your dates have times as well, you can use the functions that start with ymd, dmy, mdy or ydm and are followed by any of _h, _hm or _hms
    • To see all the functions available look at ymd() for dates and ymd_hms() for datetimes
  • lubridate::parse_date_time(x=, orders=)
    • The orders = argument is a sequence of characters, reflecting the order in the input
    • y-year with century, Y-year without century, m-month, d-day, H-hours (24-hour), M-minutes, S-seconds, and many others
    • a-abbreviated weekday, A-full weekday, b-abbreviate month, B-full month, I-hours (12-hour), p-AM/PM, z-timezone (offset in minutes/seconds from UTC)
    • Can pass a vector of sequences to orders=, such as orders=c(“ymd”, “dmy”), if some of the dates are formatted differently than others

Weather in Auckland (data from Weather Underground, METAR from Auckland airport):

  • Data are available in akl_weather_daily.csv and akl_weather_hourly_2016.csv
  • The lubridate::make_date(year, month, date) will produce a date from its components (these components can be vectors, such as columns in a frame
    • There is also a lubridate::make_datetime(year, month, day, hour, min, sec)

Extracting parts of a datetime:

  • The lubridate::year() will pull out the year from a datetime object
    • month(), day(), hour(), minute(), second() will do the same
    • wday() is the weekday (1-7), while yday() is the Julian date (1-366) and tz() is the timezone
  • The extractors can also be used to set a component of the datetime object
  • Several functions return booleans, more or less answers to “is this a” questions
    • leap_year(), am(), pm(), dst(), quarter() will return 1-4, semester() will return 1-2
    • Months of course are different lengths so we should really correct for that, take a look at days_in_month() for helping with that

Rounding datetimes:

  • The lubridate::floor_date(unit=) will round-down to the requested unit, such as “hour”
    • round_date() for nearest
    • ceiling_date() for round-up
  • Units can be specified as “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”, “halfyear”, “year”

Example code includes:

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(readr)
library(dplyr)
library(ggplot2)
library(ggridges)
library(stringr)


# Parse x 
x <- "2010 September 20th" # 2010-09-20
ymd(x)
## [1] "2010-09-20"
# Parse y 
y <- "02.01.2010"  # 2010-01-02
dmy(y)
## [1] "2010-01-02"
# Parse z 
z <- "Sep, 12th 2010 14:00"  # 2010-09-12T14:00
mdy_hm(z)
## [1] "2010-09-12 14:00:00 UTC"
# Specify an order string to parse x
x <- "Monday June 1st 2010 at 4pm"
parse_date_time(x, orders = "AmdyIp")
## [1] "2010-06-01 16:00:00 UTC"
# Specify order to include both "mdy" and "dmy"
two_orders <- c("October 7, 2001", "October 13, 2002", "April 13, 2003", 
  "17 April 2005", "23 April 2017")
parse_date_time(two_orders, orders = c("mdy", "dmy"))
## [1] "2001-10-07 UTC" "2002-10-13 UTC" "2003-04-13 UTC" "2005-04-17 UTC"
## [5] "2017-04-23 UTC"
# Specify order to include "dOmY", "OmY" and "Y"
short_dates <- c("11 December 1282", "May 1372", "1253")
parse_date_time(short_dates, orders = c("dOmY", "OmY", "Y"))
## [1] "1282-12-11 UTC" "1372-05-01 UTC" "1253-01-01 UTC"
# Import CSV with read_csv()
akl_daily_raw <- read_csv("./RInputFiles/akl_weather_daily.csv")
## Parsed with column specification:
## cols(
##   date = col_character(),
##   max_temp = col_integer(),
##   min_temp = col_integer(),
##   mean_temp = col_integer(),
##   mean_rh = col_integer(),
##   events = col_character(),
##   cloud_cover = col_integer()
## )
# Print akl_daily_raw
akl_daily_raw
## # A tibble: 3,661 x 7
##    date      max_temp min_temp mean_temp mean_rh events cloud_cover
##    <chr>        <int>    <int>     <int>   <int> <chr>        <int>
##  1 2007-9-1        60       51        56      75 <NA>             4
##  2 2007-9-2        60       53        56      82 Rain             4
##  3 2007-9-3        57       51        54      78 <NA>             6
##  4 2007-9-4        64       50        57      80 Rain             6
##  5 2007-9-5        53       48        50      90 Rain             7
##  6 2007-9-6        57       42        50      69 <NA>             1
##  7 2007-9-7        59       41        50      77 <NA>             4
##  8 2007-9-8        59       46        52      80 <NA>             5
##  9 2007-9-9        55       50        52      88 Rain             7
## 10 2007-9-10       59       50        54      82 Rain             4
## # ... with 3,651 more rows
# Parse date 
akl_daily <- akl_daily_raw %>%
  mutate(date = ymd(date))

# Print akl_daily
akl_daily
## # A tibble: 3,661 x 7
##    date       max_temp min_temp mean_temp mean_rh events cloud_cover
##    <date>        <int>    <int>     <int>   <int> <chr>        <int>
##  1 2007-09-01       60       51        56      75 <NA>             4
##  2 2007-09-02       60       53        56      82 Rain             4
##  3 2007-09-03       57       51        54      78 <NA>             6
##  4 2007-09-04       64       50        57      80 Rain             6
##  5 2007-09-05       53       48        50      90 Rain             7
##  6 2007-09-06       57       42        50      69 <NA>             1
##  7 2007-09-07       59       41        50      77 <NA>             4
##  8 2007-09-08       59       46        52      80 <NA>             5
##  9 2007-09-09       55       50        52      88 Rain             7
## 10 2007-09-10       59       50        54      82 Rain             4
## # ... with 3,651 more rows
# Plot to check work
ggplot(akl_daily, aes(x = date, y = max_temp)) +
  geom_line() 
## Warning: Removed 1 rows containing missing values (geom_path).

# Import "akl_weather_hourly_2016.csv"
akl_hourly_raw <- read_csv("./RInputFiles/akl_weather_hourly_2016.csv")
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   month = col_integer(),
##   mday = col_integer(),
##   time = col_time(format = ""),
##   temperature = col_double(),
##   weather = col_character(),
##   conditions = col_character(),
##   events = col_character(),
##   humidity = col_integer(),
##   date_utc = col_datetime(format = "")
## )
# Print akl_hourly_raw
akl_hourly_raw
## # A tibble: 17,454 x 10
##     year month  mday time   temperature weather conditions events humidity
##    <int> <int> <int> <time>       <dbl> <chr>   <chr>      <chr>     <int>
##  1  2016     1     1 00:00         68.0 Clear   Clear      <NA>         68
##  2  2016     1     1 00:30         68.0 Clear   Clear      <NA>         68
##  3  2016     1     1 01:00         68.0 Clear   Clear      <NA>         73
##  4  2016     1     1 01:30         68.0 Clear   Clear      <NA>         68
##  5  2016     1     1 02:00         68.0 Clear   Clear      <NA>         68
##  6  2016     1     1 02:30         68.0 Clear   Clear      <NA>         68
##  7  2016     1     1 03:00         68.0 Clear   Clear      <NA>         68
##  8  2016     1     1 03:30         68.0 Cloudy  Partly Cl~ <NA>         68
##  9  2016     1     1 04:00         68.0 Cloudy  Scattered~ <NA>         68
## 10  2016     1     1 04:30         66.2 Cloudy  Partly Cl~ <NA>         73
## # ... with 17,444 more rows, and 1 more variable: date_utc <dttm>
# Use make_date() to combine year, month and mday 
akl_hourly  <- akl_hourly_raw  %>% 
  mutate(date = make_date(year = year, month = month, day = mday))

# Parse datetime_string 
akl_hourly <- akl_hourly  %>% 
  mutate(
    datetime_string = paste(date, time, sep = "T"),
    datetime = ymd_hms(datetime_string)
  )

# Print date, time and datetime columns of akl_hourly
akl_hourly %>% select(date, time, datetime)
## # A tibble: 17,454 x 3
##    date       time   datetime           
##    <date>     <time> <dttm>             
##  1 2016-01-01 00:00  2016-01-01 00:00:00
##  2 2016-01-01 00:30  2016-01-01 00:30:00
##  3 2016-01-01 01:00  2016-01-01 01:00:00
##  4 2016-01-01 01:30  2016-01-01 01:30:00
##  5 2016-01-01 02:00  2016-01-01 02:00:00
##  6 2016-01-01 02:30  2016-01-01 02:30:00
##  7 2016-01-01 03:00  2016-01-01 03:00:00
##  8 2016-01-01 03:30  2016-01-01 03:30:00
##  9 2016-01-01 04:00  2016-01-01 04:00:00
## 10 2016-01-01 04:30  2016-01-01 04:30:00
## # ... with 17,444 more rows
# Plot to check work
ggplot(akl_hourly, aes(x = datetime, y = temperature)) +
  geom_line()

# Examine the head() of release_time
releases <- read_csv("./RInputFiles/rversions.csv")
## Parsed with column specification:
## cols(
##   major = col_integer(),
##   minor = col_integer(),
##   patch = col_integer(),
##   date = col_date(format = ""),
##   datetime = col_datetime(format = ""),
##   time = col_time(format = ""),
##   type = col_character()
## )
release_time <- releases %>% pull(datetime)
head(release_time)
## [1] "1997-12-04 08:47:58 UTC" "1997-12-21 13:09:22 UTC"
## [3] "1998-01-10 00:31:55 UTC" "1998-03-14 19:25:55 UTC"
## [5] "1998-05-02 07:58:17 UTC" "1998-06-14 12:56:20 UTC"
# Examine the head() of the months of release_time
head(month(release_time))
## [1] 12 12  1  3  5  6
# Extract the month of releases 
month(release_time) %>% table()
## .
##  1  2  3  4  5  6  7  8  9 10 11 12 
##  5  6  8 18  5 16  4  7  2 15  6 13
# Extract the year of releases
year(release_time) %>% table()
## .
## 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
##    2   10    9    6    6    5    5    4    4    4    4    6    5    4    6 
## 2012 2013 2014 2015 2016 2017 
##    4    4    4    5    5    3
# How often is the hour before 12 (noon)?
mean(hour(release_time) < 12)
## [1] 0.752381
# How often is the release in am?
mean(am(release_time))
## [1] 0.752381
# Use wday() to tabulate release by day of the week
wday(releases$datetime) %>% table()
## .
##  1  2  3  4  5  6  7 
##  3 29  9 12 18 31  3
# Add label = TRUE to make table more readable
wday(releases$datetime, label=TRUE) %>% table()
## .
## Sun Mon Tue Wed Thu Fri Sat 
##   3  29   9  12  18  31   3
# Create column wday to hold labelled week days
releases$wday <- wday(releases$datetime, label=TRUE)

# Plot barchart of weekday by type of release
ggplot(releases, aes(x=wday)) +
  geom_bar() +
  facet_wrap(~ type, ncol = 1, scale = "free_y")

# Add columns for year, yday and month
akl_daily <- akl_daily %>%
  mutate(
    year = year(date),
    yday = yday(date),
    month = month(date, label=TRUE))

# Plot max_temp by yday for all years
ggplot(akl_daily, aes(x = yday, y = max_temp)) +
  geom_line(aes(group = year), alpha = 0.5)
## Warning: Removed 1 rows containing missing values (geom_path).

# Examine distribtion of max_temp by month
ggplot(akl_daily, aes(x = max_temp, y = month, height = ..density..)) +
  geom_density_ridges(stat = "density")
## Warning: Removed 10 rows containing non-finite values (stat_density).

# Create new columns hour, month and rainy
akl_hourly <- akl_hourly %>%
  mutate(
    hour = hour(datetime),
    month = month(datetime, label=TRUE),
    rainy = (weather == "Precipitation")
  )

# Filter for hours between 8am and 10pm (inclusive)
akl_day <- akl_hourly %>% 
  filter(hour >= 8, hour <= 22)

# Summarise for each date if there is any rain
rainy_days <- akl_day %>% 
  group_by(month, date) %>%
  summarise(
    any_rain = any(rainy)
  )

# Summarise for each month, the number of days with rain
rainy_days %>% 
  summarise(
    days_rainy = sum(any_rain)
  )
## # A tibble: 12 x 2
##    month days_rainy
##    <ord>      <int>
##  1 Jan           15
##  2 Feb           13
##  3 Mar           12
##  4 Apr           15
##  5 May           21
##  6 Jun           19
##  7 Jul           22
##  8 Aug           16
##  9 Sep           25
## 10 Oct           20
## 11 Nov           19
## 12 Dec           11
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")

# Round down to day
floor_date(r_3_4_1, unit = "day")
## [1] "2016-05-03 UTC"
# Round to nearest 5 minutes
round_date(r_3_4_1, unit = "5 minutes")
## [1] "2016-05-03 07:15:00 UTC"
# Round up to week 
ceiling_date(r_3_4_1, unit = "week")
## [1] "2016-05-08 UTC"
# Subtract r_3_4_1 rounded down to day
r_3_4_1 - floor_date(r_3_4_1, unit = "day")
## Time difference of 7.224444 hours
# Create day_hour, datetime rounded down to hour
akl_hourly <- akl_hourly %>%
  mutate(
    day_hour = floor_date(datetime, unit = "hour")
  )

# Count observations per hour  
akl_hourly %>% 
  count(day_hour) 
## # A tibble: 8,770 x 2
##    day_hour                n
##    <dttm>              <int>
##  1 2016-01-01 00:00:00     2
##  2 2016-01-01 01:00:00     2
##  3 2016-01-01 02:00:00     2
##  4 2016-01-01 03:00:00     2
##  5 2016-01-01 04:00:00     2
##  6 2016-01-01 05:00:00     2
##  7 2016-01-01 06:00:00     2
##  8 2016-01-01 07:00:00     2
##  9 2016-01-01 08:00:00     2
## 10 2016-01-01 09:00:00     2
## # ... with 8,760 more rows
# Find day_hours with n != 2  
akl_hourly %>% 
  count(day_hour) %>%
  filter(n != 2) %>% 
  arrange(desc(n))
## # A tibble: 92 x 2
##    day_hour                n
##    <dttm>              <int>
##  1 2016-04-03 02:00:00     4
##  2 2016-09-25 00:00:00     4
##  3 2016-06-26 09:00:00     1
##  4 2016-09-01 23:00:00     1
##  5 2016-09-02 01:00:00     1
##  6 2016-09-04 11:00:00     1
##  7 2016-09-04 16:00:00     1
##  8 2016-09-04 17:00:00     1
##  9 2016-09-05 00:00:00     1
## 10 2016-09-05 15:00:00     1
## # ... with 82 more rows

Chapter 3 - Arithmetic with Dates and Times

Taking differences of datetimes:

  • Pure subtraction will give the days between two datetimes, reported on the command line as “Time difference of x days”
    • The difftime(day1, day2, units=) function is the same as day1 - day2, but with additional control of being able to request units (secs, mins, hours, days, weeks)
  • The today() function gives you today’s date as a Date object
  • The now() function gives you the current date-time as a POSIXct object

Time spans - difficult because they do not have a constant meaning (e.g., impact of daylight savings time):

  • The lubridate package manages time spans as EITHER period or duration
    • The period is the way a human thinks about it - 1 day means same exact hour-minute-second tomorrow
    • The duration is the way a stopwatch thinks about it - 1 day means 24 hours from now
  • The period time span in lubridate is called by adding an “s” to the end of the relevant function
    • For example, days(x=1) will be exactly +1 in the days category only (all other units untouched)
  • The duration in lubridate is called by adding a “d” to the front of the relevant period function
    • For example, ddays(x=1) will add 24 hours to the datetime
  • There was an eclipse over North America on 2017-08-21 at 18:26:40
    • It’s possible to predict the next eclipse with similar geometry by calculating the time and date one Saros in the future
    • A Saros is a length of time that corresponds to 223 Synodic months, a Synodic month being the period of the Moon’s phases, a duration of 29 days, 12 hours, 44 minutes and 3 seconds
  • What should ymd(“2018-01-31”) + months(1) return? Should it be 30, 31 or 28 days in the future? Try it
    • In general lubridate returns the same day of the month in the next month, but since the 31st of February doesn’t exist lubridate returns a missing value, NA
    • There are alternative addition and subtraction operators: %m+% and %m-% that have different behavior
    • Rather than returning an NA for a non-existent date, they roll back to the last existing date
    • But use these operators with caution, unlike + and -, you might not get x back from x %m+% months(1) %m-% months(1)
    • If you’d prefer that the date was rolled forward check out add_with_rollback() which has roll_to_first argument

Intervals - third option in lubridate for storing times:

  • Can find length, whether an object is in the interval, whether various intervals overlap, and the like
    • Intervals can be created either by using interval(datetime1, datetime2) or datetime1 %–% datetime2
  • There are many lubridate functions for working with intervals
    • int_start() and int_end() will give back the start and end date for the interval
    • int_length() will give back the interval length in seconds
    • as.period() will return the interval length as a period, while as.duration() will return the interval length as a duration
    • aDateTime %within% anInterval will return a boolean that answers the question
    • The int_overlaps(int1, int2) will return a boolean for whether there is any overlap
  • Intervals tend to be best when you have a specific start and end date
    • Otherwise, use periods for human purposes and durations for technical purposes
  • The operator %within% tests if the datetime (or interval) on the left hand side is within the interval of the right hand side
    • int_overlaps() performs a similar test, but will return true if two intervals overlap at all

Example code includes:

# The date of landing and moment of step
date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")

# How many days since the first man on the moon?
difftime(today(), date_landing, units = "days")
## Time difference of 17773 days
# How many seconds since the first man on the moon?
difftime(now(), moment_step, units = "secs")
## Time difference of 1535625267 secs
# Three dates
mar_11 <- ymd_hms("2017-03-11 12:00:00", 
  tz = "America/Los_Angeles")
mar_12 <- ymd_hms("2017-03-12 12:00:00", 
  tz = "America/Los_Angeles")
mar_13 <- ymd_hms("2017-03-13 12:00:00", 
  tz = "America/Los_Angeles")

# Difference between mar_13 and mar_12 in seconds
difftime(mar_13, mar_12, units = "secs")
## Time difference of 86400 secs
# Difference between mar_12 and mar_11 in seconds
difftime(mar_12, mar_11, units = "secs")
## Time difference of 82800 secs
# Add a period of one week to mon_2pm
mon_2pm <- dmy_hm("27 Aug 2018 14:00")
mon_2pm + weeks(1)
## [1] "2018-09-03 14:00:00 UTC"
# Add a duration of 81 hours to tue_9am
tue_9am <- dmy_hm("28 Aug 2018 9:00")
tue_9am + dhours(81)
## [1] "2018-08-31 18:00:00 UTC"
# Subtract a period of five years from today()
today() - years(5)
## [1] "2013-03-18"
# Subtract a duration of five years from today()
today() - dyears(5)
## [1] "2013-03-19"
# Time of North American Eclipse 2017
eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")

# Duration of 29 days, 12 hours, 44 mins and 3 secs
synodic <- ddays(29) + dhours(12) + dminutes(44) + dseconds(3)

# 223 synodic months
saros <- 223 * synodic

# Add saros to eclipse_2017
eclipse_2017 + saros
## [1] "2035-09-02 02:09:49 UTC"
# Add a period of 8 hours to today
today_8am <- today() + hours(8)

# Sequence of two weeks from 1 to 26
every_two_weeks <- 1:26 * weeks(2)

# Create datetime for every two weeks for a year
today_8am + every_two_weeks
##  [1] "2018-04-01 08:00:00 UTC" "2018-04-15 08:00:00 UTC"
##  [3] "2018-04-29 08:00:00 UTC" "2018-05-13 08:00:00 UTC"
##  [5] "2018-05-27 08:00:00 UTC" "2018-06-10 08:00:00 UTC"
##  [7] "2018-06-24 08:00:00 UTC" "2018-07-08 08:00:00 UTC"
##  [9] "2018-07-22 08:00:00 UTC" "2018-08-05 08:00:00 UTC"
## [11] "2018-08-19 08:00:00 UTC" "2018-09-02 08:00:00 UTC"
## [13] "2018-09-16 08:00:00 UTC" "2018-09-30 08:00:00 UTC"
## [15] "2018-10-14 08:00:00 UTC" "2018-10-28 08:00:00 UTC"
## [17] "2018-11-11 08:00:00 UTC" "2018-11-25 08:00:00 UTC"
## [19] "2018-12-09 08:00:00 UTC" "2018-12-23 08:00:00 UTC"
## [21] "2019-01-06 08:00:00 UTC" "2019-01-20 08:00:00 UTC"
## [23] "2019-02-03 08:00:00 UTC" "2019-02-17 08:00:00 UTC"
## [25] "2019-03-03 08:00:00 UTC" "2019-03-17 08:00:00 UTC"
jan_31 <- ymd("2018-01-31")
# A sequence of 1 to 12 periods of 1 month
month_seq <- 1:12 * months(1)

# Add 1 to 12 months to jan_31
jan_31 + month_seq
##  [1] NA           "2018-03-31" NA           "2018-05-31" NA          
##  [6] "2018-07-31" "2018-08-31" NA           "2018-10-31" NA          
## [11] "2018-12-31" "2019-01-31"
# Replace + with %m+%
jan_31 %m+% month_seq
##  [1] "2018-02-28" "2018-03-31" "2018-04-30" "2018-05-31" "2018-06-30"
##  [6] "2018-07-31" "2018-08-31" "2018-09-30" "2018-10-31" "2018-11-30"
## [11] "2018-12-31" "2019-01-31"
# Replace + with %m-%
jan_31 %m-% month_seq
##  [1] "2017-12-31" "2017-11-30" "2017-10-31" "2017-09-30" "2017-08-31"
##  [6] "2017-07-31" "2017-06-30" "2017-05-31" "2017-04-30" "2017-03-31"
## [11] "2017-02-28" "2017-01-31"
# Create monarchs
mNames <- c('Elizabeth II' ,'Victoria' ,'George V' ,'George III' ,'George VI' ,'George IV' ,'Edward VII' ,'William IV' ,'Edward VIII' ,'George III(also United Kingdom)' ,'George II' ,'George I' ,'Anne' ,'Henry III' ,'Edward III' ,'Elizabeth I' ,'Henry VI' ,'Henry VI' ,'Æthelred II' ,'Æthelred II' ,'Henry VIII' ,'Charles II' ,'Henry I' ,'Henry II(co-ruler with Henry the Young King)' ,'Edward I' ,'Alfred the Great' ,'Edward the Elder' ,'Charles I' ,'Henry VII' ,'Edward the Confessor' ,'Richard II' ,'James I' ,'Edward IV' ,'Edward IV' ,'William I' ,'Edward II' ,'Cnut' ,'Stephen' ,'Stephen' ,'John' ,'Edgar I' ,'Æthelstan' ,'Henry IV' ,'William III(co-ruler with Mary II)' ,'Henry the Young King(co-ruler with Henry II)' ,'William II' ,'Richard I' ,'Eadred' ,'Henry V' ,'Edmund I' ,'Edward VI' ,'Mary II(co-ruler with William III)' ,'Mary I' ,'Anne(also Kingdom of Great Britain)' ,'Eadwig' ,'James II' ,'Edward the Martyr' ,'Harold I' ,'Harthacnut' ,'Richard III' ,'Louis (disputed)' ,'Harold II' ,'Edmund II' ,'Matilda (disputed)' ,'Edward V' ,'Edgar II' ,'Sweyn Forkbeard' ,'Jane (disputed)' ,'James VI' ,'William I' ,'Constantine II' ,'David II' ,'Alexander III' ,'Malcolm III' ,'Alexander II' ,'James I' ,'Malcolm II' ,'James V' ,'David I' ,'James III' ,'Charles II' ,'Charles II' ,'James IV' ,'Mary I' ,'Charles I' ,'Kenneth II' ,'James II' ,'Robert I' ,'Robert II' ,'Alexander I' ,'Macbeth' ,'Robert III' ,'Constantine I' ,'Kenneth MacAlpin' ,'William II' ,'Malcolm IV' ,'Giric(co-ruler with Eochaid?)' ,'Donald II' ,'Malcolm I' ,'Edgar' ,'Kenneth III' ,'Indulf' ,'Duncan I' ,'Mary II' ,'Amlaíb' ,'Anne(also Kingdom of Great Britain)' ,'Dub' ,'Cuilén' ,'Domnall mac Ailpín' ,'James VII' ,'Margaret' ,'John Balliol' ,'Donald III' ,'Constantine III' ,'Áed mac Cináeda' ,'Lulach' ,'Duncan II' ,'Ruaidrí Ua Conchobair' ,'Edward Bruce (disputed)' ,'Brian Ua Néill (disputed)' ,'Gruffudd ap Cynan' ,'Llywelyn the Great' ,'Owain Gwynedd' ,'Dafydd ab Owain Gwynedd' ,'Hywel ab Owain Gwynedd' ,'Llywelyn ap Gruffudd' ,'Owain Glyndŵr (disputed)' ,'Owain Goch ap Gruffydd' ,'Owain Lawgoch (disputed)' ,'Dafydd ap Llywelyn' ,'Dafydd ap Gruffydd')
mDominion <- c('United Kingdom' ,'United Kingdom' ,'United Kingdom' ,'United Kingdom' ,'United Kingdom' ,'United Kingdom' ,'United Kingdom' ,'United Kingdom' ,'United Kingdom' ,'Great Britain' ,'Great Britain' ,'Great Britain' ,'Great Britain' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'England' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Scotland' ,'Ireland' ,'Ireland' ,'Ireland' ,'Gwynedd' ,'Gwynedd' ,'Gwynedd' ,'Gwynedd' ,'Gwynedd' ,'Wales' ,'Wales' ,'Wales' ,'Wales' ,'Wales' ,'Wales')
mFrom <- c('1952-02-06' ,'1837-06-20' ,'1910-05-06' ,'1801-01-01' ,'1936-12-11' ,'1820-01-29' ,'1901-01-22' ,'1830-06-26' ,'1936-01-20' ,'1760-10-25' ,'1727-06-22' ,'1714-08-01' ,'1707-05-01' ,'NA' ,'1327-01-25' ,'1558-11-17' ,'1422-08-31' ,'1470-10-31' ,'978-03-18' ,'1014-02-03' ,'1509-04-22' ,'1649-01-30' ,'1100-08-03' ,'1154-10-25' ,'1272-11-20' ,'871-04-24' ,'899-10-27' ,'1625-03-27' ,'1485-08-22' ,'1042-06-08' ,'1377-06-22' ,'1603-03-24' ,'1461-03-04' ,'1471-04-11' ,'1066-12-12' ,'1307-07-07' ,'1016-11-30' ,'1135-12-22' ,'1141-11-01' ,'1199-04-06' ,'959-10-01' ,'924-08-02' ,'1399-09-29' ,'1689-02-13' ,'1170-06-14' ,'1087-09-09' ,'1189-07-06' ,'946-05-26' ,'1413-03-21' ,'939-10-27' ,'1547-01-28' ,'1689-02-13' ,'1553-07-19' ,'1702-03-08' ,'955-11-23' ,'1685-02-06' ,'975-07-09' ,'1037-11-12' ,'1040-03-17' ,'1483-06-26' ,'1216-06-14' ,'1066-01-05' ,'1016-04-23' ,'1141-04-07' ,'1483-04-09' ,'1066-10-15' ,'1013-12-25' ,'1553-07-10' ,'1567-07-24' ,'1165-12-09' ,'900-01-01' ,'1329-06-07' ,'1249-07-06' ,'1058-03-17' ,'1214-12-04' ,'1406-04-04' ,'1005-03-25' ,'1513-09-09' ,'1124-04-23' ,'1460-08-03' ,'1649-01-30' ,'1660-05-29' ,'1488-06-11' ,'1542-12-14' ,'1625-03-27' ,'971-01-01' ,'1437-02-21' ,'1306-03-25' ,'1371-02-22' ,'1107-01-08' ,'1040-08-14' ,'1390-04-19' ,'862-01-01' ,'843-01-01' ,'1689-05-11' ,'1153-05-24' ,'878-01-01' ,'889-01-01' ,'943-01-01' ,'1097-01-01' ,'997-01-01' ,'954-01-01' ,'1034-11-25' ,'1689-04-11' ,'971-01-01' ,'1702-03-08' ,'962-01-01' ,'NA' ,'858-01-01' ,'1685-02-06' ,'1286-11-25' ,'1292-11-17' ,'1093-11-13' ,'1095-01-01' ,'877-01-01' ,'1057-08-15' ,'1094-05-01' ,'1166-01-01' ,'1315-06-01' ,'1258-01-01' ,'1081-01-01' ,'1195-01-01' ,'1137-01-01' ,'1170-01-01' ,'1170-01-01' ,'1253-01-01' ,'1400-09-16' ,'1246-02-25' ,'1372-05-01' ,'1240-04-12' ,'1282-12-11')
mTo <- c('2018-02-08' ,'1901-01-22' ,'1936-01-20' ,'1820-01-29' ,'1952-02-06' ,'1830-06-26' ,'1910-05-06' ,'1837-06-20' ,'1936-12-11' ,'1801-01-01' ,'1760-10-25' ,'1727-06-11' ,'1714-08-01' ,'1272-11-16' ,'1377-06-21' ,'1603-03-24' ,'1461-03-04' ,'1471-04-11' ,'1013-12-25' ,'1016-04-23' ,'1547-01-28' ,'1685-02-06' ,'1135-12-01' ,'1189-07-06' ,'1307-07-07' ,'899-10-26' ,'924-07-17' ,'1649-01-30' ,'1509-04-21' ,'1066-01-05' ,'1399-09-29' ,'1625-03-27' ,'1470-10-03' ,'1483-04-09' ,'1087-09-09' ,'1327-01-20' ,'1035-11-12' ,'1141-04-07' ,'1154-10-25' ,'1216-10-19' ,'975-07-08' ,'939-10-27' ,'1413-03-20' ,'1702-03-08' ,'1183-06-11' ,'1100-08-02' ,'1199-04-06' ,'955-11-23' ,'1422-08-31' ,'946-05-26' ,'1553-07-06' ,'1694-12-28' ,'1558-11-17' ,'1707-04-30' ,'959-10-01' ,'1688-12-11' ,'978-03-18' ,'1040-03-17' ,'1042-06-08' ,'1485-08-22' ,'1217-09-22' ,'1066-10-14' ,'1016-11-30' ,'1141-11-01' ,'1483-06-26' ,'1066-12-17' ,'1014-02-03' ,'1553-07-19' ,'1625-03-27' ,'1214-12-04' ,'943-01-01' ,'1371-02-22' ,'1286-03-19' ,'1093-11-13' ,'1249-07-06' ,'1437-02-21' ,'1034-11-25' ,'1542-12-14' ,'1153-05-24' ,'1488-06-11' ,'1651-09-03' ,'1685-02-06' ,'1513-09-09' ,'1567-07-24' ,'1649-01-30' ,'995-01-01' ,'1460-08-03' ,'1329-06-07' ,'1390-04-19' ,'1124-04-23' ,'1057-08-15' ,'1406-04-04' ,'877-01-01' ,'858-02-13' ,'1702-03-08' ,'1165-12-09' ,'889-01-01' ,'900-01-01' ,'954-01-01' ,'1107-01-08' ,'1005-03-25' ,'962-01-01' ,'1040-08-14' ,'1694-12-28' ,'977-01-01' ,'1707-04-30' ,'NA' ,'971-01-01' ,'862-04-13' ,'1688-12-11' ,'1290-09-26' ,'1296-07-10' ,'1097-01-01' ,'1097-01-01' ,'878-01-01' ,'1058-03-17' ,'1094-11-12' ,'1193-01-01' ,'1318-10-14' ,'1260-01-01' ,'1137-01-01' ,'1240-04-11' ,'1170-01-01' ,'1195-01-01' ,'1170-01-01' ,'1282-12-11' ,'1416-01-01' ,'1255-01-01' ,'1378-07-01' ,'1246-02-25' ,'1283-10-03')

padMDate <- function(x) { 
    if (is.na(x[1]) | x[1] == "NA") { 
        NA 
    } else { 
        paste0(c(str_pad(x[1], 4, pad="0"), x[2], x[3]), collapse="-") 
    } 
}



monarchs <- tibble::tibble(name=mNames, dominion=mDominion, 
                           from=ymd(sapply(str_split(mFrom, "-"), FUN=padMDate)), 
                           to=ymd(sapply(str_split(mTo, "-"), FUN=padMDate))
                           )

# Print monarchs
monarchs
## # A tibble: 131 x 4
##    name                            dominion       from       to        
##    <chr>                           <chr>          <date>     <date>    
##  1 Elizabeth II                    United Kingdom 1952-02-06 2018-02-08
##  2 Victoria                        United Kingdom 1837-06-20 1901-01-22
##  3 George V                        United Kingdom 1910-05-06 1936-01-20
##  4 George III                      United Kingdom 1801-01-01 1820-01-29
##  5 George VI                       United Kingdom 1936-12-11 1952-02-06
##  6 George IV                       United Kingdom 1820-01-29 1830-06-26
##  7 Edward VII                      United Kingdom 1901-01-22 1910-05-06
##  8 William IV                      United Kingdom 1830-06-26 1837-06-20
##  9 Edward VIII                     United Kingdom 1936-01-20 1936-12-11
## 10 George III(also United Kingdom) Great Britain  1760-10-25 1801-01-01
## # ... with 121 more rows
# Create an interval for reign
monarchs <- monarchs %>%
  mutate(reign = from %--% to) 

# Find the length of reign, and arrange
monarchs %>%
  mutate(length = int_length(reign)) %>% 
  arrange(desc(length)) %>%
  select(name, length, dominion)
## # A tibble: 131 x 3
##    name                   length dominion      
##    <chr>                   <dbl> <chr>         
##  1 Elizabeth II       2083017600 United Kingdom
##  2 Victoria           2006726400 United Kingdom
##  3 James VI           1820102400 Scotland      
##  4 Gruffudd ap Cynan  1767139200 Gwynedd       
##  5 Edward III         1590624000 England       
##  6 William I          1545868800 Scotland      
##  7 Llywelyn the Great 1428796800 Gwynedd       
##  8 Elizabeth I        1399507200 England       
##  9 Constantine II     1356912000 Scotland      
## 10 David II           1316304000 Scotland      
## # ... with 121 more rows
# Print halleys
pDate <- c('66-01-26', '141-03-25', '218-04-06', '295-04-07', '374-02-13', '451-07-03', '530-11-15', '607-03-26', '684-11-26', '760-06-10', '837-02-25', '912-07-27', '989-09-02', '1066-03-25', '1145-04-19', '1222-09-10', '1301-10-22', '1378-11-09', '1456-01-08', '1531-08-26', '1607-10-27', '1682-09-15', '1758-03-13', '1835-11-16', '1910-04-20', '1986-02-09', '2061-07-28')
sDate <- c('66-01-25', '141-03-22', '218-04-06', '295-04-07', '374-02-13', '451-06-28', '530-09-27', '607-03-15', '684-10-02', '760-05-20', '837-02-25', '912-07-18', '989-09-02', '1066-01-01', '1145-04-15', '1222-09-10', '1301-10-22', '1378-11-09', '1456-01-08', '1531-08-26', '1607-10-27', '1682-09-15', '1758-03-13', '1835-08-01', '1910-04-20', '1986-02-09', '2061-07-28')
eDate <- c('66-01-26', '141-03-25', '218-05-17', '295-04-20', '374-02-16', '451-07-03', '530-11-15', '607-03-26', '684-11-26', '760-06-10', '837-02-28', '912-07-27', '989-09-05', '1066-03-25', '1145-04-19', '1222-09-28', '1301-10-31', '1378-11-14', '1456-06-09', '1531-08-26', '1607-10-27', '1682-09-15', '1758-12-25', '1835-11-16', '1910-05-20', '1986-02-09', '2061-07-28')

halleys <- tibble::tibble(perihelion_date=ymd(sapply(str_split(pDate, "-"), FUN=padMDate)), 
                          start_date=ymd(sapply(str_split(sDate, "-"), FUN=padMDate)), 
                          end_date=ymd(sapply(str_split(eDate, "-"), FUN=padMDate))
                          )


# New column for interval from start to end date
halleys <- halleys %>%
  mutate(visible = start_date %--% end_date)

# The visitation of 1066
halleys_1066 <- halleys[14, ]

# Monarchs in power on perihelion date
monarchs %>%
  filter(halleys_1066$perihelion_date %within% reign) %>%
  select(name, from, to, dominion)
## # A tibble: 2 x 4
##   name        from       to         dominion
##   <chr>       <date>     <date>     <chr>   
## 1 Harold II   1066-01-05 1066-10-14 England 
## 2 Malcolm III 1058-03-17 1093-11-13 Scotland
# Monarchs whose reign overlaps visible time
monarchs %>%
  filter(int_overlaps(halleys_1066$visible, reign)) %>%
  select(name, from, to, dominion)
## # A tibble: 3 x 4
##   name                 from       to         dominion
##   <chr>                <date>     <date>     <chr>   
## 1 Edward the Confessor 1042-06-08 1066-01-05 England 
## 2 Harold II            1066-01-05 1066-10-14 England 
## 3 Malcolm III          1058-03-17 1093-11-13 Scotland
# New columns for duration and period
monarchs <- monarchs %>%
  mutate(
    duration = as.duration(reign),
    period = as.period(reign))

# Examine results    
monarchs %>% 
    select(name, duration, period) %>%
    head(10) %>%
    print.data.frame()
##                               name                   duration
## 1                     Elizabeth II 2083017600s (~66.01 years)
## 2                         Victoria 2006726400s (~63.59 years)
## 3                         George V  811296000s (~25.71 years)
## 4                       George III  601948800s (~19.07 years)
## 5                        George VI  478224000s (~15.15 years)
## 6                        George IV  328406400s (~10.41 years)
## 7                       Edward VII   292982400s (~9.28 years)
## 8                       William IV   220406400s (~6.98 years)
## 9                      Edward VIII   28166400s (~46.57 weeks)
## 10 George III(also United Kingdom) 1268092800s (~40.18 years)
##                 period
## 1   66y 0m 2d 0H 0M 0S
## 2   63y 7m 2d 0H 0M 0S
## 3  25y 8m 14d 0H 0M 0S
## 4  19y 0m 28d 0H 0M 0S
## 5  15y 1m 26d 0H 0M 0S
## 6  10y 4m 28d 0H 0M 0S
## 7   9y 3m 14d 0H 0M 0S
## 8  6y 11m 25d 0H 0M 0S
## 9     10m 21d 0H 0M 0S
## 10  40y 2m 7d 0H 0M 0S

Chapter 4 - Problems in Practice

Time zones - ways to keep track of times in different locations (can pose analysis challenges):

  • Typically captured as an offset from GMT, but specified in R using tz= since the offset to GMT can change during the year (DST for example)
    • Sys.timezone() gives the timezone on your computer
    • OlsonNames() gives all the timezones that R is aware of
    • The OlsonNames() function matches with an international standard as to which cities are included
    • The lubridate::tz() will extract the timezone from a specific datetime
  • Can change the timezone without changing the underlying clock time components by using lubridate::force_tz()
    • force_tz(ymd_hm(“2017-12-12 12:00”, tz=“America/Los_Angeles”), tzone=“America/Boston”) will produce 2017-12-12 12:00 EST (note that the 12:00 is held, with ONLY time-zone changed)
  • Can view the time in a different zone by using lubridate::with_tz()
    • with_tz(ymd_hm(“2017-12-12 12:00”, tz=“America/Los_Angeles”), tzone=“America/Boston”) will produce 2017-12-12 15:00 EST (note that 15:00 EST and 12:00 PST are the same)
  • For this entire course, if you’ve ever had a time, it’s always had an accompanying date, i.e. a datetime. But sometimes you just have a time without a date
    • If you find yourself in this situation, the hms package provides an hms class of object for holding times without dates, and the best place to start would be with as.hms()
    • readr knows the hms class, so if it comes across something that looks like a time it will use it

Importing and exporting datetimes:

  • The parse_date_time() function is designed to be forgiving and flexible, but at the expense of being slow (since it considers many possible formats)
    • The fasttime::fastPOSIXct() is designed to very quickly read a proper ISO formatting of “YYYY-MM-DD”
    • The lubridate::fast_strptime(x=, format=) is also fast, but it requires a valid strptime format like “%Y-%m-%d” rather than the more flexible/forgiving parse_date_time(x=, order=“ymd”)
    • See help for strptime() for the valid strings
  • The readr::write_csv() will write datetime objects in a proper ISO format, making for easy read-in
  • Can also use the lubridate::stamp() capability to build a function that will format things based on an example you provide
    • my_stamp <- stamp(“Tuesday October 10 2017”)
    • my_stamp has been created by lubridate::stamp() as function(x) format(x, format=“%A %B %d %Y”) to match the example given

Wrap-up:

  • Chapter 1: base R objects Date, POSIXct
    • lubridate, zoo, xts, and the like all work together with each other and these
  • Chapter 2: importing and manipulating datetime obects
  • Chapter 3: challenges of arithmetic with datetimes
    • periods, durations, intervals
  • Chapter 4: time zones, and import/outputs

Example code includes:

# Game2: CAN vs NZL in Edmonton
game2 <- mdy_hm("June 11 2015 19:00")

# Game3: CHN vs NZL in Winnipeg
game3 <- mdy_hm("June 15 2015 18:30")

# Set the timezone to "America/Edmonton"
game2_local <- force_tz(game2, tzone = "America/Edmonton")
game2_local
## [1] "2015-06-11 19:00:00 MDT"
# Set the timezone to "America/Winnipeg"
game3_local <- force_tz(game3, tzone = "America/Winnipeg")
game3_local
## [1] "2015-06-15 18:30:00 CDT"
# How long does the team have to rest?
as.period(game2_local %--% game3_local)
## [1] "3d 22H 30M 0S"
# What time is game2_local in NZ?
with_tz(game2_local, tzone = "Pacific/Auckland")
## [1] "2015-06-12 13:00:00 NZST"
# What time is game2_local in Corvallis, Oregon?
with_tz(game2_local, tzone = "America/Los_Angeles")
## [1] "2015-06-11 18:00:00 PDT"
# What time is game3_local in NZ?
with_tz(game3_local, tzone = "Pacific/Auckland")
## [1] "2015-06-16 11:30:00 NZST"
# Examine datetime and date_utc columns
head(akl_hourly$datetime)
## [1] "2016-01-01 00:00:00 UTC" "2016-01-01 00:30:00 UTC"
## [3] "2016-01-01 01:00:00 UTC" "2016-01-01 01:30:00 UTC"
## [5] "2016-01-01 02:00:00 UTC" "2016-01-01 02:30:00 UTC"
head(akl_hourly$date_utc)
## [1] "2015-12-31 11:00:00 UTC" "2015-12-31 11:30:00 UTC"
## [3] "2015-12-31 12:00:00 UTC" "2015-12-31 12:30:00 UTC"
## [5] "2015-12-31 13:00:00 UTC" "2015-12-31 13:30:00 UTC"
# Force datetime to Pacific/Auckland
akl_hourly <- akl_hourly %>%
  mutate(
    datetime = force_tz(datetime, tzone = "Pacific/Auckland"))

# Reexamine datetime
head(akl_hourly$datetime)
## [1] "2016-01-01 00:00:00 NZDT" "2016-01-01 00:30:00 NZDT"
## [3] "2016-01-01 01:00:00 NZDT" "2016-01-01 01:30:00 NZDT"
## [5] "2016-01-01 02:00:00 NZDT" "2016-01-01 02:30:00 NZDT"
# Are datetime and date_utc the same moments
table(akl_hourly$datetime - akl_hourly$date_utc)
## 
## -82800      0   3600 
##      2  17450      2
# Import auckland hourly data 
akl_hourly <- read_csv("./RInputFiles/akl_weather_hourly_2016.csv")
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   month = col_integer(),
##   mday = col_integer(),
##   time = col_time(format = ""),
##   temperature = col_double(),
##   weather = col_character(),
##   conditions = col_character(),
##   events = col_character(),
##   humidity = col_integer(),
##   date_utc = col_datetime(format = "")
## )
# Examine structure of time column
str(akl_hourly$time)
## Classes 'hms', 'difftime'  atomic [1:17454] 0 1800 3600 5400 7200 9000 10800 12600 14400 16200 ...
##   ..- attr(*, "units")= chr "secs"
# Examine head of time column
head(akl_hourly$time)
## 00:00:00
## 00:30:00
## 01:00:00
## 01:30:00
## 02:00:00
## 02:30:00
# A plot using just time
ggplot(akl_hourly, aes(x = time, y = temperature)) +
  geom_line(aes(group = make_date(year, month, mday)), alpha = 0.2)

library(microbenchmark)
library(fasttime)

# Examine structure of dates
dates <- paste0(gsub(" ", "T", as.character(akl_hourly$date_utc)), "Z")

str(dates)
##  chr [1:17454] "2015-12-31T11:00:00Z" "2015-12-31T11:30:00Z" ...
# Use fastPOSIXct() to parse dates
fastPOSIXct(dates) %>% str()
##  POSIXct[1:17454], format: "2015-12-31 05:00:00" "2015-12-31 05:30:00" ...
# Compare speed of fastPOSIXct() to ymd_hms()
microbenchmark(
  ymd_hms = ymd_hms(dates),
  fasttime = fastPOSIXct(dates),
  times = 20)
## Unit: milliseconds
##      expr      min        lq      mean    median        uq       max neval
##   ymd_hms 19.65648 25.450991 31.008432 29.757100 35.059554 54.315367    20
##  fasttime  1.54544  1.771433  2.623122  1.866963  2.416451  7.597718    20
##  cld
##    b
##   a
# Head of dates
head(dates)
## [1] "2015-12-31T11:00:00Z" "2015-12-31T11:30:00Z" "2015-12-31T12:00:00Z"
## [4] "2015-12-31T12:30:00Z" "2015-12-31T13:00:00Z" "2015-12-31T13:30:00Z"
# Parse dates with fast_strptime
fast_strptime(dates, 
    format = "%Y-%m-%dT%H:%M:%SZ") %>% str()
##  POSIXlt[1:17454], format: "2015-12-31 11:00:00" "2015-12-31 11:30:00" ...
# Comparse speed to ymd_hms() and fasttime
microbenchmark(
  ymd_hms = ymd_hms(dates),
  fasttime = fastPOSIXct(dates),
  fast_strptime = fast_strptime(dates, 
    format = "%Y-%m-%dT%H:%M:%SZ"),
  times = 20)
## Unit: milliseconds
##           expr       min        lq      mean    median        uq
##        ymd_hms 19.044623 23.569423 33.620204 27.006693 30.107636
##       fasttime  1.541492  1.677286  1.910483  1.744392  1.891042
##  fast_strptime  1.289248  1.391093  2.314567  1.466490  2.148220
##         max neval cld
##  162.411612    20   b
##    3.482074    20  a 
##    7.368369    20  a
finished <- "I finished 'Dates and Times in R' on Thursday, September 20, 2017!"
# Create a stamp based on "Sep 20 2017"
date_stamp <- stamp("September 20, 2017", orders="mdy")
## Multiple formats matched: "%Om %d, %Y"(1), "%B %d, %Y"(1)
## Using: "%B %d, %Y"
# Print date_stamp
date_stamp
## function (x, locale = "English_United States.1252") 
## {
##     {
##         old_lc_time <- Sys.getlocale("LC_TIME")
##         if (old_lc_time != locale) {
##             Sys.setlocale("LC_TIME", locale)
##             on.exit(Sys.setlocale("LC_TIME", old_lc_time))
##         }
##     }
##     format(x, format = "%B %d, %Y")
## }
## <environment: 0x0000000011784ba0>
# Call date_stamp on today()
date_stamp(today())
## [1] "March 18, 2018"
# Create and call a stamp based on "09/20/2017"
stamp("09/20/2017", orders="mdy")(today())
## Multiple formats matched: "%Om/%d/%Y"(1), "%m/%d/%Y"(1)
## Using: "%Om/%d/%Y"
## [1] "03/18/2018"
# Use string finished for stamp()
stamp(finished, orders="amdy")(today())
## Multiple formats matched: "I finished 'Dates and Times in R' on %A, %B %d, %Y!"(1), "I finished 'Dates and Times in R' on %A, %Om %d, %Y!"(0)
## Using: "I finished 'Dates and Times in R' on %A, %B %d, %Y!"
## [1] "I finished 'Dates and Times in R' on Sunday, March 18, 2018!"

Scalable Data Processing in R

Chapter 1 - Working with Increasingly Large Data Sets

What is scalable data processing?:

  • Working with data that is too large for one computer
  • Scalable code lets you work in parallel, and use resources as they become available
  • Data sets are frequently much bigger than available RAM, which is a challenge since R by default runs using R
    • “R is not well suited to working with data larger than 10%-20% of a computer’s RAM” - The R Installation and Administration Manual
    • When a computer runs out of RAM, it “swaps” to the hard drive, vastly slowing down the calculations
  • A more scalable solution is as follows
    • Move a subset of data in to RAM
    • Process the subset
    • Keep the results and discard the subset
  • Code may be slow due to complexity of calculations
    • Consider the disk operations needed
  • Benchmarking using microbenchmark() can be critical

Working with “out of core” objects using the Bigmemory Project:

  • Package “bigmemory” was written by Kane (instructor for this course) to store, manipulate, and process matrices exceeding RAM
    • Core object is a big.matrix and it is designed to manage situations where disk space is much greater than RAM
    • The process of moving data to RAM only when needed is called “out of core” processing
  • By default, a big.matrix keeps data on the disk, only moving the data to RAM as needed
    • The movements to/from RAM are implicit, which is to say that they are managed by the package
    • Only a single import is needed
  • The big.matrix is created using big.matrix(nrow=, ncol=, init=, type=, backingfile=, descriptorfile=)
    • The nrow, ncol are the same as matrix(), while init is the initial value to assign everywhere and type is a quoted type such as “double” or “integer”
    • The backingfile is a quoted file name that will hold the binary representation of the big.matrix on the disk, with extension .bin
    • The descriptorfile is a quoted file name that will hold some metadata such as the number of rows/columns, name, and the like
  • Supposing that x is a big,matrix, then the default print(x) obtained by x on the command line is to show a few slots/pointers
    • To have contents of x printed, use x[ , ]
    • Assignments can be made using x[myRow, myColumn] <- myValue
  • The read.big.matrix() function is meant to look similar to read.table() but, in addition, needs to know:
    • what type of numeric values you want to read (“char”, “short”, “integer”, “double”)
    • name of the file that will hold the matrix’s data (the backing file)
    • name of the file to hold information about the matrix (a descriptor file)
    • Result will be a file on the disk holding the value read in along with a descriptor file which holds extra information (like the number of columns and rows) about the resulting big.matrix object
  • A final advantage to using big.matrix is that if you know how to use R’s matrices, then you know how to use a big.matrix
    • You can subset columns and rows just as you would a regular matrix, using a numeric or character vector and the object returned is an R matrix
    • Likewise, assignments are the same as with R matrices and after those assignments are made they are stored on disk and can be used in the current and future R sessions
    • One thing to remember is that $ is not valid for getting a column of either a matrix or a big.matrix

References vs. Copies:

  • Can subset and make assignments to a big.matrix much like a matrix
  • There are a few key differences between a big.matrix and a matrix
    • big.matrix is stored on the disk (persists across R sessions, can be shared across R sessions)
    • R typically makes copies during assignment, which is why changing a variable inside a function (playing with the copy) has no impact on the variable outside the function
    • However, some objects such as environments are not copied, so modifying them inside a function modified them globally (outside the function) also
    • The big.matrix is not copied, and is instead a reference object; thus, you have to explicitly request a copy, which means 1) you have more control, but 2) you need to be more careful
  • The reference vs. copy for big.matrix objects seems in some ways similar to Python
    • a <- b will set a to reference the same data as b; changing a or changing b means changing both
    • a <- deepcopy() will produce a copy of a and assign it to b; much like a = b[:] in Python

Example code includes:

# Load the microbenchmark package
library(microbenchmark)

# Compare the timings for sorting different sizes of vector
mb <- microbenchmark(
  # Sort a random normal vector length 1e5
  "1e5" = sort(rnorm(1e5)),
  # Sort a random normal vector length 2.5e5
  "2.5e5" = sort(rnorm(2.5e5)),
  # Sort a random normal vector length 5e5
  "5e5" = sort(rnorm(5e5)),
  "7.5e5" = sort(rnorm(7.5e5)),
  "1e6" = sort(rnorm(1e6)),
  times = 10
)

# Plot the resulting benchmark object
plot(mb)

# Load the bigmemory package
library(bigmemory)

# Create the big.matrix object: x
x <- read.big.matrix("./RInputFiles/mortgage-sample.csv", header = TRUE, 
                     type = "integer", 
                     backingfile = "mortgage-sample.bin", 
                     descriptorfile = "mortgage-sample.desc")
    
# Find the dimensions of x
dim(x)
## [1] 70000    16
# Attach mortgage-sample.desc
mort <- attach.big.matrix("mortgage-sample.desc")

# Find the dimensions of mort
dim(mort)
## [1] 70000    16
# Look at the first 6 rows of mort
head(mort)
##      enterprise record_number msa perc_minority tract_income_ratio
## [1,]          1           566   1             1                  3
## [2,]          1           116   1             3                  2
## [3,]          1           239   1             2                  2
## [4,]          1            62   1             2                  3
## [5,]          1           106   1             2                  3
## [6,]          1           759   1             3                  3
##      borrower_income_ratio loan_purpose federal_guarantee borrower_race
## [1,]                     1            2                 4             3
## [2,]                     1            2                 4             5
## [3,]                     3            8                 4             5
## [4,]                     3            2                 4             5
## [5,]                     3            2                 4             9
## [6,]                     2            2                 4             9
##      co_borrower_race borrower_gender co_borrower_gender num_units
## [1,]                9               2                  4         1
## [2,]                9               1                  4         1
## [3,]                5               1                  2         1
## [4,]                9               2                  4         1
## [5,]                9               3                  4         1
## [6,]                9               1                  2         2
##      affordability year type
## [1,]             3 2010    1
## [2,]             3 2008    1
## [3,]             4 2014    0
## [4,]             4 2009    1
## [5,]             4 2013    1
## [6,]             4 2010    1
# Create mort
mort <- attach.big.matrix("mortgage-sample.desc")

# Look at the first 3 rows
mort[1:3, ]
##      enterprise record_number msa perc_minority tract_income_ratio
## [1,]          1           566   1             1                  3
## [2,]          1           116   1             3                  2
## [3,]          1           239   1             2                  2
##      borrower_income_ratio loan_purpose federal_guarantee borrower_race
## [1,]                     1            2                 4             3
## [2,]                     1            2                 4             5
## [3,]                     3            8                 4             5
##      co_borrower_race borrower_gender co_borrower_gender num_units
## [1,]                9               2                  4         1
## [2,]                9               1                  4         1
## [3,]                5               1                  2         1
##      affordability year type
## [1,]             3 2010    1
## [2,]             3 2008    1
## [3,]             4 2014    0
# Create a table of the number of mortgages for each year in the data set
table(mort[, "year"])
## 
##  2008  2009  2010  2011  2012  2013  2014  2015 
##  8468 11101  8836  7996 10935 10216  5714  6734
a <- getLoadedDLLs()
length(a)
## [1] 39
R.utils::gcDLLs()
## named list()
a <- getLoadedDLLs()
length(a)
## [1] 39
# Load the biganalytics package (error in loading to Knit file, works OK otherwise)
library(biganalytics)
## Loading required package: foreach
## Loading required package: biglm
## Loading required package: DBI
# Get the column means of mort
colmean(mort)
##            enterprise         record_number                   msa 
##             1.3814571           499.9080571             0.8943571 
##         perc_minority    tract_income_ratio borrower_income_ratio 
##             1.9701857             2.3431571             2.6898857 
##          loan_purpose     federal_guarantee         borrower_race 
##             3.7670143             3.9840857             5.3572429 
##      co_borrower_race       borrower_gender    co_borrower_gender 
##             7.0002714             1.4590714             3.0494857 
##             num_units         affordability                  year 
##             1.0398143             4.2863429          2011.2714714 
##                  type 
##             0.5300429
# Use biganalytics' summary function to get a summary of the data
summary(mort)
##                                min          max         mean          NAs
## enterprise               1.0000000    2.0000000    1.3814571    0.0000000
## record_number            0.0000000  999.0000000  499.9080571    0.0000000
## msa                      0.0000000    1.0000000    0.8943571    0.0000000
## perc_minority            1.0000000    9.0000000    1.9701857    0.0000000
## tract_income_ratio       1.0000000    9.0000000    2.3431571    0.0000000
## borrower_income_ratio    1.0000000    9.0000000    2.6898857    0.0000000
## loan_purpose             1.0000000    9.0000000    3.7670143    0.0000000
## federal_guarantee        1.0000000    4.0000000    3.9840857    0.0000000
## borrower_race            1.0000000    9.0000000    5.3572429    0.0000000
## co_borrower_race         1.0000000    9.0000000    7.0002714    0.0000000
## borrower_gender          1.0000000    9.0000000    1.4590714    0.0000000
## co_borrower_gender       1.0000000    9.0000000    3.0494857    0.0000000
## num_units                1.0000000    4.0000000    1.0398143    0.0000000
## affordability            0.0000000    9.0000000    4.2863429    0.0000000
## year                  2008.0000000 2015.0000000 2011.2714714    0.0000000
## type                     0.0000000    1.0000000    0.5300429    0.0000000
# Use deepcopy() to create first_three
first_three <- deepcopy(mort, cols = 1:3, 
                        backingfile = "first_three.bin", 
                        descriptorfile = "first_three.desc")

# Set first_three_2 equal to first_three
first_three_2 <- first_three

# Set the value in the first row and first column of first_three to NA
first_three[1, 1] <- NA

# Verify the change shows up in first_three_2
first_three_2[1, 1]
## [1] NA
# but not in mort
mort[1, 1]
## [1] 1

Chapter 2 - Processing and Analyzing Data with bigmemory

The Bigmemory Suite of Packages:

  • Many packages have been designed to work together with a big.matrix object
    • biganalytics - summarizing
    • bigtabulate - split and tabulate (includes the bigtable(x, quotedColumnVector))
    • bigalgebra - linear algenra
    • bigpca - PCA
    • bigFastLM - linear regressions
    • biglasso - lasso regressions
    • bigrf - random forests
  • FHFA Dataset has data about millions of mortgages - difference in ownership rates, defaults, etc.
    • Course will use a 70,000 record subset
    • Raw data (full 2.5 GB dataset) available at FHFA (fhfa.gov)
    • Code works the same on subsets and full data sets

Split-Apply-Combine (aka Split-Compute-Combine), run in this course using split() Map() Reduce():

  • The split() function partitions the data, whether randomly or based on a factor variable
    • split(myData, myFactor) will produce a list, with each element of the list containing the requested data (one per myFactor)
  • The Map() function processes each of the partitions
    • Map(myFunction, mySplitList) will apply the myFunction to each of the items in the mySplitList, with the output a list named like mySplitList
  • The Reduce() function combines the (typically processed) data from a list
    • Reduce(myFunction, myMapList) will apply the myFunction while combining the items in myMapList
    • A common function might be rbind or ‘+’ (add them up)

Visualize results using tidyverse:

  • The pipe (%>%) operator works well with many of the big.matrix functions, since the first argument is a dataset
  • Can combine some of big.matrix processing outputs with standard packages like dplyr and tidyr and ggplot

Limitations of bigmemory - process is useful for dense, numeric matrices that can be stored on hard disk:

  • Underlying structures are compatible with low-level linear algebra libraries for fast fitting
  • If you have different column types, you can try the ff package (similar to bigmemory but includes structures like a data.frame)
  • The bigmemory object is said to be “random access”, which means it is equally easy to get access to any specific component
  • There are some big drawbacks to the “random access” capabilities, however
    • Cannot add rows or columns - need to create an entirely new object and port over the relevant data
    • Need enough disk space to hold the entire matrix in a block
    • Can instead use other tools to process data using a “continuous chunks” approach - discussed in the next chapter

Example code includes:

library(bigtabulate)
library(tidyr)
library(ggplot2)
library(biganalytics)
library(dplyr)


race_cat <- c('Native Am', 'Asian', 'Black', 'Pacific Is', 'White', 'Two or More', 'Hispanic', 'Not Avail')

# Call bigtable to create a variable called race_table
race_table <- bigtable(mort, "borrower_race")

# Rename the elements of race_table
names(race_table) <- race_cat
race_table
##   Native Am       Asian       Black  Pacific Is       White Two or More 
##         143        4438        2020         195       50006         528 
##    Hispanic   Not Avail 
##        4040        8630
# Create a table of the borrower race by year
race_year_table <- bigtable(mort, c("borrower_race", "year"))

# Convert rydf to a data frame
rydf <- as.data.frame(race_year_table)

# Create the new column Race
rydf$Race <- race_cat

# Let's see what it looks like
rydf
##   2008 2009 2010 2011 2012 2013 2014 2015        Race
## 1   11   18   13   16   15   12   29   29   Native Am
## 2  384  583  603  568  770  673  369  488       Asian
## 3  363  320  209  204  258  312  185  169       Black
## 4   33   38   21   13   28   22   17   23  Pacific Is
## 5 5552 7739 6301 5746 8192 7535 4110 4831       White
## 6   43   85   65   58   89   78   46   64 Two or More
## 7  577  563  384  378  574  613  439  512    Hispanic
## 9 1505 1755 1240 1013 1009  971  519  618   Not Avail
female_residence_prop <- function(x, rows) {
    x_subset <- x[rows, ]
    # Find the proporation of female borrowers in urban areas
    prop_female_urban <- sum(x_subset[, "borrower_gender"] == 2 & 
                                 x_subset[, "msa"] == 1) / 
        sum(x_subset[, "msa"] == 1)
    # Find the proporation of female borrowers in rural areas
    prop_female_rural <- sum(x_subset[, "borrower_gender"] == 2 & 
                                 x_subset[, "msa"] == 0) / 
        sum(x_subset[, "msa"] == 0)
    
    c(prop_female_urban, prop_female_rural)
}

# Find the proportion of female borrowers in 2015
female_residence_prop(mort, mort[, "year"] == 2015)
## [1] 0.2737439 0.2304965
# Split the row numbers of the mortage data by year
spl <- split(1:nrow(mort), mort[, "year"])

# Call str on spl
str(spl)
## List of 8
##  $ 2008: int [1:8468] 2 8 15 17 18 28 35 40 42 47 ...
##  $ 2009: int [1:11101] 4 13 25 31 43 49 52 56 67 68 ...
##  $ 2010: int [1:8836] 1 6 7 10 21 23 24 27 29 38 ...
##  $ 2011: int [1:7996] 11 20 37 46 53 57 73 83 86 87 ...
##  $ 2012: int [1:10935] 14 16 26 30 32 33 48 69 81 94 ...
##  $ 2013: int [1:10216] 5 9 19 22 36 44 55 58 72 74 ...
##  $ 2014: int [1:5714] 3 12 50 60 64 66 103 114 122 130 ...
##  $ 2015: int [1:6734] 34 41 54 61 62 65 82 91 102 135 ...
# For each of the row splits, find the female residence proportion
all_years <- Map(function(rows) female_residence_prop(mort, rows), spl)

# Call str on all_years
str(all_years)
## List of 8
##  $ 2008: num [1:2] 0.275 0.204
##  $ 2009: num [1:2] 0.244 0.2
##  $ 2010: num [1:2] 0.241 0.201
##  $ 2011: num [1:2] 0.252 0.241
##  $ 2012: num [1:2] 0.244 0.21
##  $ 2013: num [1:2] 0.275 0.257
##  $ 2014: num [1:2] 0.289 0.268
##  $ 2015: num [1:2] 0.274 0.23
# Collect the results as rows in a matrix
prop_female <- Reduce(rbind, all_years)

# Rename the row and column names
dimnames(prop_female) <- list(names(all_years), c("prop_female_urban", "prop_femal_rural"))

# View the matrix
prop_female
##      prop_female_urban prop_femal_rural
## 2008         0.2748514        0.2039474
## 2009         0.2441074        0.2002978
## 2010         0.2413881        0.2014028
## 2011         0.2520644        0.2408931
## 2012         0.2438950        0.2101313
## 2013         0.2751059        0.2567164
## 2014         0.2886756        0.2678571
## 2015         0.2737439        0.2304965
# Convert prop_female to a data frame
prop_female_df <- as.data.frame(prop_female)

# Add a new column Year
prop_female_df$Year <- row.names(prop_female_df)

# Call gather on prop_female_df
prop_female_long <- gather(prop_female_df, Region, Prop, -Year)

# Create a line plot
ggplot(prop_female_long, aes(x = Year, y = Prop, group = Region, color = Region)) + 
    geom_line()

# Call summary on mort
summary(mort)
##                                min          max         mean          NAs
## enterprise               1.0000000    2.0000000    1.3814571    0.0000000
## record_number            0.0000000  999.0000000  499.9080571    0.0000000
## msa                      0.0000000    1.0000000    0.8943571    0.0000000
## perc_minority            1.0000000    9.0000000    1.9701857    0.0000000
## tract_income_ratio       1.0000000    9.0000000    2.3431571    0.0000000
## borrower_income_ratio    1.0000000    9.0000000    2.6898857    0.0000000
## loan_purpose             1.0000000    9.0000000    3.7670143    0.0000000
## federal_guarantee        1.0000000    4.0000000    3.9840857    0.0000000
## borrower_race            1.0000000    9.0000000    5.3572429    0.0000000
## co_borrower_race         1.0000000    9.0000000    7.0002714    0.0000000
## borrower_gender          1.0000000    9.0000000    1.4590714    0.0000000
## co_borrower_gender       1.0000000    9.0000000    3.0494857    0.0000000
## num_units                1.0000000    4.0000000    1.0398143    0.0000000
## affordability            0.0000000    9.0000000    4.2863429    0.0000000
## year                  2008.0000000 2015.0000000 2011.2714714    0.0000000
## type                     0.0000000    1.0000000    0.5300429    0.0000000
bir_df_wide <- bigtable(mort, c("borrower_income_ratio", "year")) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column() %>% 
    filter(rowname %in% c(1, 2, 3)) %>% 
    select(-rowname) %>%
    # Create a new column called BIR with the corresponding table categories
    mutate(BIR = c(">=0,<=50%", ">50, <=80%", ">80%"))

bir_df_wide
##   2008 2009 2010 2011 2012 2013 2014 2015        BIR
## 1 1205 1473  600  620  745  725  401  380  >=0,<=50%
## 2 2095 2791 1554 1421 1819 1861 1032 1145 >50, <=80%
## 3 4844 6707 6609 5934 8338 7559 4255 5169       >80%
bir_df_wide %>% 
    # Transform the wide-formatted data.frame into the long format
    gather(Year, Count, -BIR) %>%
    # Use ggplot to create a line plot
    ggplot(aes(x = Year, y = Count, group = BIR, color = BIR)) + 
    geom_line()


Chapter 3 - Working with iotools

Introduction to chunk-wise processing - solution to challenges from bigmemory:

  • The iotools allows for processing the data in “chunks”, allowing for data frames, data across many machines, and the like
  • Can process chunks either sequentially (keep as needed after each chunk runs) or independently
    • Independent processing is typically harder to code (final result must be combined), but allows for parallel processing
  • Sometimes Split-Apply-Combine cannot be made to work, such as trying to find a median (even keeping some extra data per chunk – such as sum and count when end goal is mean – will not work)
    • Fortunately, most regressions can be successfully run using the Split-Apply-Combine methodology
  • An operation that gives the same answer whether you apply it to an entire data set or to chunks of a data set and then on the results on the chunks is sometimes called foldable
    • The max() and min() operations are an example of this

First look at iotools: Importing data:

  • Basic components of chunk-wise processing include 1) load pieces of data, 2) convert to native objects, 3) perform computation and store results, and 4) repeated as needed until finished
  • Loading data often takes more time than processing the data (retrieval from disk and conversion to readable formats)
  • The iotools package is designed to separate the physical loading of data and the parsing of data in to R objects for better flexibility and performance
    • readAsRaw() reads the entire data in to a raw vector
    • read.chunk() reads the data in chunks in to a raw vector
  • The iotools can then parse the data in to either a matrix or a data frame
    • mstrsplit() converts raw data in to a matrix
    • dstrsplit() converts raw data in to a data frame
    • read.delim.raw() = readAsRaw() + dstrsplit()
  • Processing contiguous chunks means there is no need to have read all the data in advance (such as to create the spl vector by 1:nrows by myVar)
  • When processing a sequence of contiguous chunks of data on a hard drive, iotools can turn a raw object into a data.frame or matrix while - at the same time - retrieving the next chunk of data
    • These optimizations allow iotools to quickly process very large files

Using chunk.apply - effectively moves away from what is functionally a “for loop” to allow better parallel processing:

  • iotools is the basis of hmr which allows for running R on TB of data using Hadoop
  • The general usage is chunk.apply(myFile=, myFunction=, CH.MAX.SIZE=) # this will apply myFunction across chunks of size CH.MAX.SIZE in myFile
    • Output will be a matrix where each row is one of the chunks and each column is one of (or the only) output from myFunction for that chunk
    • There is an optional parallel= option; the argument supplied is the number of parallel clusters to be used
  • By default, chunk.apply() aggregates the processed data using the rbind() function
    • This means that you can create a table from each of the chunks and then add up the rows of the resulting matrix to get the total counts for the table
  • When the parallel parameter is set to a value greater than one on Linux and Unix machine (including the Mac) multiple processes read and process data at the same time thereby reducing the execution time
    • On Windows the parallel parameter is ignored

Example code includes:

foldable_range <- function(x) {
  if (is.list(x)) {
    # If x is a list then reduce it by the min and max of each element in the list
    c(Reduce(min, x), Reduce(max, x))
  } else {
    # Otherwise, assume it's a vector and find it's range
    range(x)
  }
}

# Verify that foldable_range() works on the record_number column
foldable_range(mort[, "record_number"])
## [1]   0 999
# Split the mortgage data by year
spl <- split(1:nrow(mort), mort[, "year"])

# Use foldable_range() to get the range of the record numbers
foldable_range(Map(function(s) foldable_range(mort[s, "record_number"]), spl))
## [1]   0 999
# Load the iotools and microbenchmark packages
library(iotools)
library(microbenchmark)

# Time the reading of files
microbenchmark(
    # Time the reading of a file using read.delim five times
    read.delim("./RInputFiles/mortgage-sample.csv", header = FALSE, sep = ","),
    # Time the reading of a file using read.delim.raw five times
    read.delim.raw("./RInputFiles/mortgage-sample.csv", header = FALSE, sep = ","),
    times = 5
)
## Unit: milliseconds
##                                                                                 expr
##      read.delim("./RInputFiles/mortgage-sample.csv", header = FALSE,      sep = ",")
##  read.delim.raw("./RInputFiles/mortgage-sample.csv", header = FALSE,      sep = ",")
##       min       lq      mean    median        uq       max neval cld
##  273.9307 274.5402 307.78187 284.07930 302.93168 403.42744     5   b
##   60.5291  65.4149  70.97342  67.44982  73.48039  87.99291     5  a
# Read mortgage-sample.csv as a raw vector
raw_file_content <- readAsRaw("./RInputFiles/mortgage-sample.csv")

# Convert the raw vector contents to a matrix
mort_mat <- mstrsplit(raw_file_content, sep = ",", type = "integer", skip = 1)

# Look at the first 6 rows
head(mort_mat)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    1  566    1    1    3    1    2    4    3     9     2     4     1
## [2,]    1  116    1    3    2    1    2    4    5     9     1     4     1
## [3,]    1  239    1    2    2    3    8    4    5     5     1     2     1
## [4,]    1   62    1    2    3    3    2    4    5     9     2     4     1
## [5,]    1  106    1    2    3    3    2    4    9     9     3     4     1
## [6,]    1  759    1    3    3    2    2    4    9     9     1     2     2
##      [,14] [,15] [,16]
## [1,]     3  2010     1
## [2,]     3  2008     1
## [3,]     4  2014     0
## [4,]     4  2009     1
## [5,]     4  2013     1
## [6,]     4  2010     1
# Convert the raw file contents to a data.frame
mort_df <- dstrsplit(raw_file_content, sep = ",", col_types = rep("integer", 16), skip = 1)

# Look at the first 6 rows
head(mort_df)
##   V1  V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14  V15 V16
## 1  1 566  1  1  3  1  2  4  3   9   2   4   1   3 2010   1
## 2  1 116  1  3  2  1  2  4  5   9   1   4   1   3 2008   1
## 3  1 239  1  2  2  3  8  4  5   5   1   2   1   4 2014   0
## 4  1  62  1  2  3  3  2  4  5   9   2   4   1   4 2009   1
## 5  1 106  1  2  3  3  2  4  9   9   3   4   1   4 2013   1
## 6  1 759  1  3  3  2  2  4  9   9   1   2   2   4 2010   1
# We have created a file connection fc to the "mortgage-sample.csv" file and read in the first line to get rid of the header.
# Define the function to apply to each chunk
make_table <- function(chunk) {
    # Read each chunk as a matrix
    x <- mstrsplit(chunk, type = "integer", sep = ",")
    # Create a table of the number of borrowers (column 3) for each chunk
    table(x[, 3])
}

# Create a file connection to mortgage-sample.csv
fc <- file("./RInputFiles/mortgage-sample.csv", "rb")

# Read the first line to get rid of the header
(col_names <- readLines(fc, n = 1))
## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""
(col_names <- lapply(str_split(col_names, '\\",\\"'), FUN=function(x) { str_replace(x, '\\"', '') })[[1]])
##  [1] "enterprise"            "record_number"        
##  [3] "msa"                   "perc_minority"        
##  [5] "tract_income_ratio"    "borrower_income_ratio"
##  [7] "loan_purpose"          "federal_guarantee"    
##  [9] "borrower_race"         "co_borrower_race"     
## [11] "borrower_gender"       "co_borrower_gender"   
## [13] "num_units"             "affordability"        
## [15] "year"                  "type"
# Read the data in chunks
counts <- chunk.apply(fc, make_table, CH.MAX.SIZE = 1e5)

# Close the file connection
close(fc)

# Print counts
counts
##         0    1
##  [1,] 309 2401
##  [2,] 289 2422
##  [3,] 266 2444
##  [4,] 300 2410
##  [5,] 279 2431
##  [6,] 310 2400
##  [7,] 274 2436
##  [8,] 283 2428
##  [9,] 259 2452
## [10,] 287 2423
## [11,] 288 2423
## [12,] 283 2428
## [13,] 271 2439
## [14,] 299 2411
## [15,] 294 2416
## [16,] 305 2405
## [17,] 280 2431
## [18,] 275 2435
## [19,] 303 2407
## [20,] 279 2431
## [21,] 296 2414
## [22,] 294 2417
## [23,] 288 2424
## [24,] 264 2446
## [25,] 292 2418
## [26,] 228 2013
# Sum up the chunks
colSums(counts)
##     0     1 
##  7395 62605
msa_map <- c("rural", "urban")
# Define the function to apply to each chunk
make_msa_table <- function(chunk) {
    # Read each chunk as a data frame
    x <- dstrsplit(chunk, col_types = rep("integer", length(col_names)), sep = ",")
    # Set the column names of the data frame that's been read
    colnames(x) <- col_names
    # Create new column, msa_pretty, with a string description of where the borrower lives
    x$msa_pretty <- msa_map[x$msa + 1]
    # Create a table from the msa_pretty column
    table(x$msa_pretty)
}

# Create a file connection to mortgage-sample.csv
fc <- file("./RInputFiles/mortgage-sample.csv", "rb")

# Read the first line to get rid of the header
readLines(fc, n = 1)
## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""
# Read the data in chunks
counts <- chunk.apply(fc, make_msa_table, CH.MAX.SIZE = 1e5)

# Close the file connection
close(fc)

# Aggregate the counts as before
colSums(counts)
## rural urban 
##  7395 62605
iotools_read_fun <- function(parallel) {
    fc <- file("./RInputFiles/mortgage-sample.csv", "rb")
    readLines(fc, n = 1)
    chunk.apply(fc, make_msa_table,
                CH.MAX.SIZE = 1e5, parallel = parallel)
    close(fc)
}

# Benchmark the new function
microbenchmark(
    # Use one process
    iotools_read_fun(1), 
    # Use three processes
    iotools_read_fun(3), 
    times = 20
)
## Unit: milliseconds
##                 expr      min       lq     mean   median       uq      max
##  iotools_read_fun(1) 100.7646 103.2403 113.1136 108.0299 116.8064 156.1552
##  iotools_read_fun(3) 102.8915 107.5720 118.4578 110.3017 113.5138 243.5000
##  neval cld
##     20   a
##     20   a

Chapter 4 - Case Study: Preliminary Analysis of Housing Data

Overview of types of analysis for this chapter:

  • Compare proportions of people receiving mortgages
  • Amount of “missingness” in the data
  • Changes in 1) mortgage demographic proportions over time, and 2) city vs. rural mortgages, and 3) proportions of federally insured loans

Are the data missing at random?

  • Missing data is pervasive, including in this housing dataset
  • Three components of missing data
    • Missing Completely at Random (MCAR) - no way to predict where/what, meaning rows with missing data can just be dropped
    • Missing at Random (MAR) - missingness is dependent on variables in the dataset, meaning that multiple imputation can be successful
    • Missing Not At Random (MNAR) - typically due to deterministic relationships between missing data and other variables, beyond the scope of this course
  • Assumption for this exercise will be that data are checked for MAR and assumed to be MCAR if they are not MAR
    • For each column, recode the column as a 1/0 for missing, then run a logit on all the other variables
    • If the other variables have a statistically significant prediction effect on the 1/0 column, then that column is MAR rather than MCAR
    • Need to have a smart p-value for significance depending on number of regressions that have been run

Analyzing the Housing Data:

  • Adjusted counts - adjusting group sizes allows you to compare different groups as though they were the same size
  • Proportional change can show growth (or decline) of groups over time

Borrower Lending Trends: City vs. Rural:

  • Looking at city (MSA == 1) vs rural
  • Looking at federally guaranteed loans
    • Can use Borrower Income Ratio (borrower income divided by median income in the area)

Wrap up:

  • Split-Compute-Combine (aka Split-Apply-Combine) as enabled by bigmemory and iotools
  • Operations can be run on a single machine in series, a single machine in parallel, or across multiple machines
  • Summary of the bigmemory approach
    • Good for dense, large matrices that might otherwise overhwlem RAM
    • Looks like a regular R matrix
  • Summary of the iotools approach:
    • Good for much larger data that can be processed in sequential chunks
    • More flexible than bigmemory in that it can handle data frames and files saved on multiple disks

Example code includes:

# Create a table of borrower_race column
race_table <- bigtable(mort, "borrower_race")

# Rename the elements
names(race_table) <- race_cat[as.numeric(names(race_table))]

# Find the proportion
race_table[1:7] / sum(race_table[1:7])
##   Native Am       Asian       Black  Pacific Is       White Two or More 
## 0.002330129 0.072315464 0.032915105 0.003177448 0.814828092 0.008603552 
##    Hispanic 
## 0.065830210
mort_names <- col_names

# Create table of the borrower_race 
race_table_chunks <- chunk.apply(
    "./RInputFiles/mortgage-sample.csv", function(chunk) { 
        x <- mstrsplit(chunk, sep = ",", type = "integer") 
        colnames(x) <- mort_names 
        table(x[, "borrower_race"])
}, CH.MAX.SIZE = 1e5)

# Add up the columns
race_table <- colSums(race_table_chunks)

# Find the proportion
borrower_proportion <- race_table[1:7] / sum(race_table[1:7])

pop_proportion <- c(0.009, 0.048, 0.126, 0.002, 0.724, 0.029, 0.163)
names(pop_proportion) <- race_cat[1:7]
# Create the matrix
matrix(c(pop_proportion, borrower_proportion), byrow = TRUE, nrow = 2,
  dimnames = list(c("Population Proportion", "Borrower Proportion"), race_cat[1:7]))
##                         Native Am      Asian      Black  Pacific Is
## Population Proportion 0.009000000 0.04800000 0.12600000 0.002000000
## Borrower Proportion   0.002330129 0.07231546 0.03291511 0.003177448
##                           White Two or More   Hispanic
## Population Proportion 0.7240000 0.029000000 0.16300000
## Borrower Proportion   0.8148281 0.008603552 0.06583021
# Create a variable indicating if borrower_race is missing in the mortgage data
borrower_race_ind <- mort[, "borrower_race"] == 9

# Create a factor variable indicating the affordability
affordability_factor <- factor(mort[, "affordability"])

# Perform a logistic regression
summary(glm(borrower_race_ind ~ affordability_factor, family = binomial))
## Warning: closing unused connection 5 (./RInputFiles/mortgage-sample.csv)
## 
## Call:
## glm(formula = borrower_race_ind ~ affordability_factor, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5969  -0.5016  -0.5016  -0.5016   2.0867  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.7478     0.1376 -12.701   <2e-16 ***
## affordability_factor1  -0.2241     0.1536  -1.459   0.1447    
## affordability_factor2  -0.3090     0.1609  -1.920   0.0548 .  
## affordability_factor3  -0.2094     0.1446  -1.448   0.1476    
## affordability_factor4  -0.2619     0.1383  -1.894   0.0582 .  
## affordability_factor9   0.1131     0.1413   0.800   0.4235    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 52279  on 69999  degrees of freedom
## Residual deviance: 52166  on 69994  degrees of freedom
## AIC: 52178
## 
## Number of Fisher Scoring iterations: 4
# Open a connection to the file and skip the header
fc <- file("./RInputFiles/mortgage-sample.csv", "rb")
readLines(fc, n = 1)
## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""
# Create a function to read chunks
make_table <- function(chunk) {
    # Create a matrix
    m <- mstrsplit(chunk, sep = ",", type = "integer")
    colnames(m) <- mort_names
    # Create the output table
    bigtable(m, c("borrower_race", "year"))
}

# Import data using chunk.apply
race_year_table <- chunk.apply(fc, make_table)

# Close connection
close(fc)

# Cast it to a data frame
rydf <- as.data.frame(race_year_table)

# Create a new column Race with race/ethnicity
rydf$Race <- race_cat


# Note: We removed the row corresponding to "Not Avail".
# View rydf
rydf <- 
    rydf %>% 
    filter(Race !="Not Avail")
rydf 
##   2008 2009 2010 2011 2012 2013 2014 2015        Race
## 1   11   18   13   16   15   12   29   29   Native Am
## 2  384  583  603  568  770  673  369  488       Asian
## 3  363  320  209  204  258  312  185  169       Black
## 4   33   38   21   13   28   22   17   23  Pacific Is
## 5 5552 7739 6301 5746 8192 7535 4110 4831       White
## 6   43   85   65   58   89   78   46   64 Two or More
## 7  577  563  384  378  574  613  439  512    Hispanic
# View pop_proportion
pop_proportion
##   Native Am       Asian       Black  Pacific Is       White Two or More 
##       0.009       0.048       0.126       0.002       0.724       0.029 
##    Hispanic 
##       0.163
# Gather on all variables except Race
rydfl <- gather(rydf, Year, Count, -Race)

# Create a new adjusted count variable
rydfl$Adjusted_Count <- rydfl$Count / pop_proportion[rydfl$Race]

# Plot
ggplot(rydfl, aes(x = Year, y = Adjusted_Count, group = Race, color = Race)) + 
    geom_line()

# View rydf
rydf
##   2008 2009 2010 2011 2012 2013 2014 2015        Race
## 1   11   18   13   16   15   12   29   29   Native Am
## 2  384  583  603  568  770  673  369  488       Asian
## 3  363  320  209  204  258  312  185  169       Black
## 4   33   38   21   13   28   22   17   23  Pacific Is
## 5 5552 7739 6301 5746 8192 7535 4110 4831       White
## 6   43   85   65   58   89   78   46   64 Two or More
## 7  577  563  384  378  574  613  439  512    Hispanic
# Normalize the columns
for (i in seq_len(nrow(rydf))) {
  rydf[i, 1:8] <- rydf[i, 1:8] / rydf[i, 1]
}

# Convert the data to long format
rydf_long <- gather(rydf, Year, Proportion, -Race)

# Plot
ggplot(rydf_long, aes(x = Year, y = Proportion, group = Race, color = Race)) + 
    geom_line()

# Open a connection to the file and skip the header
fc <- file("./RInputFiles/mortgage-sample.csv", "rb")
readLines(fc, n = 1)
## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""
# Create a function to read chunks
make_table <- function(chunk) {
    # Create a matrix
    m <- mstrsplit(chunk, sep = ",", type = "integer")
    colnames(m) <- mort_names
    # Create the output table
    bigtable(m, c("msa", "year"))
}

# Import data using chunk.apply
msa_year_table <- chunk.apply(fc, make_table)

# Close connection
close(fc)

# Convert to a data frame
df_msa <- as.data.frame(msa_year_table)

# Rename columns
df_msa$MSA <- c("rural", "city")

# Gather on all columns except Year
df_msa_long <- gather(df_msa, Year, Count, -MSA)

# Plot 
ggplot(df_msa_long, aes(x = Year, y = Count, group = MSA, color = MSA)) + 
    geom_line()

# Tabulate borrower_income_ratio and federal_guarantee
ir_by_fg <- bigtable(mort, c("borrower_income_ratio", "federal_guarantee"))

# Label the columns and rows of the table
income_cat <- c('0 <= 50', '50 < 80', '> 80', 'Not Applicable')
guarantee_cat <- c('FHA/VA', 'RHS', 'HECM', 'No Guarantee')
dimnames(ir_by_fg) <- list(income_cat, guarantee_cat)

# For each row in ir_by_fg, divide by the sum of the row
for (i in seq_len(nrow(ir_by_fg))) {
  ir_by_fg[i, ] = ir_by_fg[i, ] / sum(ir_by_fg[i, ])
}

# Print
ir_by_fg
##                     FHA/VA          RHS         HECM No Guarantee
## 0 <= 50        0.008944544 0.0014636526 0.0443974630    0.9451943
## 50 < 80        0.005977548 0.0024055985 0.0026971862    0.9889197
## > 80           0.001113022 0.0002428412 0.0006475766    0.9979966
## Not Applicable 0.023676880 0.0013927577 0.0487465181    0.9261838
# Quirky fix so that the files can be used again later
rm(mort)
rm(x)
rm(first_three)
rm(first_three_2)
gc()
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  1884083 100.7    3205452 171.2  3205452 171.2
## Vcells 12883431  98.3   20709399 158.1 17929130 136.8

Working with Web Data in R

Chapter 1 - Downloading Files and Using API Clients

Introduction: Working with Web Data in R:

  • Methods for getting data from the internet in to R - frequently automatic, such as giving an internet address to read.csv()
  • Using the httr package (tidyverse) to query API using GET() and POST()
  • Using JSON and XML formats (nested data structures)
  • CSS (cascading style sheets) for extracts
  • Can use download.file() so that there is no need for repeatedly querying the same remote files
  • You could use write.table(), but then you have to worry about accidentally writing out data in a format R can’t read back in
    • An easy way to avoid this risk is to use saveRDS() and readRDS(), which save R objects in an R-specific file format, with the data structure intact
    • That means you can use it for any type of R object (even ones that don’t turn into tables easily), and not worry you’ll lose data reading it back in
    • saveRDS() takes two arguments, object, pointing to the R object to save and file pointing to where to save it to
    • readRDS() expects file, referring to the path to the RDS file to read in

Understanding Application Programming Interfaces (API) - automatically handling data changes:

  • Data are frequently made available by way of API
    • “websites, but for machines”, allowing you to query/download data automatically
  • R has several API interaction capabilities
    • Native interfaces to API
    • Hides API complexity
    • Allows for reading data as R object
  • Can find R packages for API by googling CRAN - packages frequently exist already
    • Example is library(pageviews) to get pageview counts

Access tokens and API:

  • API cients (by way of R packages) abstract away the complications of getting the data
  • The API owner frequently does care how your API client interacts with it, though
    • Overwhelming API causes problems for owner and many users
    • Access tokens are sometimes used to monitor and throttle usage
  • Getting access tokens is frequently straightforward
    • Usually requires registering an e-mail address
    • Sometimes requires an explanation
    • Example is www.wordnik.com, which can be accessed by way of library(bidnik)

Example code includes:

# Here are the URLs! As you can see they're just normal strings
csv_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets/chickwts.csv"
tsv_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_3026/datasets/tsv_data.tsv"

# Read a file in from the CSV URL and assign it to csv_data
csv_data <- read.csv(csv_url)

# Read a file in from the TSV URL and assign it to tsv_data
tsv_data <- read.delim(tsv_url)

# Examine the objects with head()
head(csv_data)
##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean
head(tsv_data)
##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean
# Download the file with download.file()
download.file(url = csv_url, destfile = "./RInputFiles/feed_data.csv")

# Read it in with read.csv()
csv_data <- read.csv("./RInputFiles/feed_data.csv")


# Add a new column: square_weight
csv_data$square_weight <- csv_data$weight ** 2

# Save it to disk with saveRDS()
saveRDS(csv_data, "./RInputFiles/modified_feed_data.RDS")

# Read it back in with readRDS()
modified_feed_data <- readRDS("./RInputFiles/modified_feed_data.RDS")

# Examine modified_feed_data
str(modified_feed_data)
## 'data.frame':    71 obs. of  3 variables:
##  $ weight       : int  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed         : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ square_weight: num  32041 25600 18496 51529 47089 ...
# Load pageviews
# library(pageviews)

# Get the pageviews for "Hadley Wickham"
hadley_pageviews <- pageviews::article_pageviews(project = "en.wikipedia", "Hadley Wickham")

# Examine the resulting object
str(hadley_pageviews)
## 'data.frame':    1 obs. of  8 variables:
##  $ project    : chr "wikipedia"
##  $ language   : chr "en"
##  $ article    : chr "Hadley_Wickham"
##  $ access     : chr "all-access"
##  $ agent      : chr "all-agents"
##  $ granularity: chr "daily"
##  $ date       : POSIXct, format: "2015-10-01"
##  $ views      : num 53
# Load birdnik
# library(birdnik)

# Get the word frequency for "vector", using api_key to access it
# vector_frequency <- word_frequency(api_key, "vector")

Chapter 2 - Using httr to interact with API Directly

GET and POST requests in theory - https and web requests in theory:

  • Interactions on the internet can be though of as the client-server communication
  • The most common request is “GET”, which is the client request for something from the server
    • The parallel is “POST”, which is asking the server to accept something from the client
    • HEAD is similar to head()
    • DELETE is a request to the server to get rid of something - typically not needed
  • The httr package enables basic communication in R
    • response <- httr::GET(url=) # will get that url
    • httr::content(response) # will tell you about the response
    • response <- httr::POST(url=) is for posting, and the recipient can figure out what to do with the data

Graceful httr - code that responds appropriately and constructs its own url:

  • Error handling - all httr requests come back with an error code (status)
    • Status: 200 (completed) - starts with 2/3 is usually fine
    • Status: 404 (no clue where to look) - starts with 4 is usually error in your code
    • Status: starts with 5 is usually error in their code
    • Can check for bad codes with http_error()
  • URL construction - frequently most of the text stays the same, with just the occasional change in other components that do
    • Directory based url are based on / and can be created using paste(sep=“/”) - very common, and very easy to create
    • Parameter based url use text like https://fakeurl.com/api.php?a=1&b=2 and can be created using GET() with its named list of parameters
    • GET(“fakeurl.com/api.php”, query = list(fruit = “peaches”, day = “thursday”))

Respectful API Usage - usage that works for the API owners as well as the clients:

  • User agents - bits of text that ID your browser, give the server some idea of what you are trying to do, can be set with user_agent(), add an e-mail address, etc.
  • Many API have rate-limiter capability - exceed and you will be blocked
    • Keep an interval between requests, such as having a sleep (or similar) capability between requests using Sys.sleep()

Example code includes:

# Load the httr package
library(httr)

# Make a GET request to http://httpbin.org/get
get_result <- GET("http://httpbin.org/get")

# Print it to inspect it
# get_result


# Make a POST request to http://httpbin.org/post with the body "this is a test"
# post_result <- POST(url="http://httpbin.org/post", body="this is a test")

# Print it to inspect it
# post_result


url <- "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Hadley_Wickham/daily/20170101/20170102"
# Make a GET request to url and save the results
pageview_response <- GET(url)

# Call content() to retrieve the data the server sent back
pageview_data <- content(pageview_response)

# Examine the results with str()
str(pageview_data)
## List of 1
##  $ items:List of 2
##   ..$ :List of 7
##   .. ..$ project    : chr "en.wikipedia"
##   .. ..$ article    : chr "Hadley_Wickham"
##   .. ..$ granularity: chr "daily"
##   .. ..$ timestamp  : chr "2017010100"
##   .. ..$ access     : chr "all-access"
##   .. ..$ agent      : chr "all-agents"
##   .. ..$ views      : int 45
##   ..$ :List of 7
##   .. ..$ project    : chr "en.wikipedia"
##   .. ..$ article    : chr "Hadley_Wickham"
##   .. ..$ granularity: chr "daily"
##   .. ..$ timestamp  : chr "2017010200"
##   .. ..$ access     : chr "all-access"
##   .. ..$ agent      : chr "all-agents"
##   .. ..$ views      : int 86
fake_url <- "http://google.com/fakepagethatdoesnotexist"

# Make the GET request
request_result <- GET(fake_url)

# Check request_result
if(http_error(request_result)){
    warning("The request failed")
} else {
    content(request_result)
}
## Warning: The request failed
# Construct a directory-based API URL to `http://swapi.co/api`,
# looking for person `1` in `people`
directory_url <- paste("http://swapi.co/api", "people", 1, sep = "/")

# Make a GET call with it
result <- GET(directory_url)


# Create list with nationality and country elements
query_params <- list(nationality = "americans", 
    country = "antigua")
    
# Make parameter-based call to httpbin, with query_params
parameter_response <- GET("https://httpbin.org/get", query = query_params)

# Print parameter_response
parameter_response
## Response [https://httpbin.org/get?nationality=americans&country=antigua]
##   Date: 2018-02-16 13:02
##   Status: 200
##   Content-Type: application/json
##   Size: 425 B
## {
##   "args": {
##     "country": "antigua", 
##     "nationality": "americans"
##   }, 
##   "headers": {
##     "Accept": "application/json, text/xml, application/xml, */*", 
##     "Accept-Encoding": "gzip, deflate", 
##     "Connection": "close", 
##     "Host": "httpbin.org", 
## ...
# Do not change the url
# url <- "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Aaron_Halfaker/daily/2015100100/2015103100"

# Add the email address and the test sentence inside user_agent()
# server_response <- GET(url, user_agent("my@email.address this is a test"))


# Construct a vector of 2 URLs
urls <- c("http://fakeurl.com/api/1.0/", "http://fakeurl.com/api/2.0/")

for(url in urls){
    # Send a GET request to url
    result <- GET(url)
    # Delay for 5 seconds between requests
    Sys.sleep(1)
}


get_pageviews <- function(article_title){
    
    url <- paste0("https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents", article_title, "daily/2015100100/2015103100", sep = "/") 
    
    response <- GET(url, user_agent("my@email.com this is a test")) 
    
    if(http_error(response)){ 
        stop("the request failed" ) 
    } else { 
        result <- content(response) 
        return(result) 
    }
}

Chapter 3 - Handling JSON and XML

JSON is a dictionary-like format (plain text) foe sending data on the internet:

  • All JSON structures are made up of objects (name-value pairs in parentheses) {“a” : “b” , “c” : “d”} and arrays [1977, 1980]
    • Values can be “string”, number, true, false, null, another object or array
    • Complicated hierarchy can easily be represented
  • Can find the type of data using httr::http_type(response)

Manipulating JSON - lists are the natural R hierarchy for JSON:

  • fromJSON() will return named lists (if key-value pairs) and unnamed lists (if arrays)
    • The simplifyDataFrame = TRUE argument will pull everything together in to a data frame if possible
    • Alternately, can run lapply (or similar) over the list that has been returned
  • One way to extract relevant data from that list is to use a package specifically designed for manipulating lists, rlist
    • rlist provides two particularly useful functions for selecting and combining elements from a list: list.select() and list.stack()
    • list.select() extracts sub-elements by name from each element in a list
    • For example using the parsed movies data from the video (movies_list), we might ask for the title and year elements from each element: list.select(movies_list, title, year)
    • The result is still a list, that is where list.stack() comes in. It will stack the elements of a list into a data frame: list.stack(list.select(movies_list, title, year))

XML Structure - plain text like JSON, but with a very different structure:

  • Consists of markup (tags) and struture (data)
    • Tags begin with < and end with >
    • Typically some stuff
    • Can privide attributes inside of tags, such as more stuff
    • There is no formal standard, though attributes are usually used only for metadata
  • XML is a hierarchical structure, and includes everything between the start tag and the end tag
    • Each element can contain many other elements
    • Sub-elements are considered to be “children” of the “parent” element they are part of; “children” of the same “parent” are called “sibling” tags
  • Just like JSON, you should first verify the response is indeed XML with http_type() and by examining the result of content(r, as = “text”)
    • Then you can turn the response into an XML document object with read_xml()
    • One benefit of using the XML document object is the available functions that help you explore and manipulate the document
    • For example xml_structure() will print a representation of the XML document that emphasizes the hierarchical structure by displaying the elements without the data

XPATH - language for specifying nodes in an XML document:

  • XPATH looks a lot like file.path, since it uses forward slash / to find the requested sub-nodes
  • xml_find_all(x=, xpath=) # x is the object such as movies_xml and path is the xpath such as “/movies/movie/title”; will return a “node set”
    • xml_text() run on a “node set” will return the data in an easier to digest format
    • The // means “any node at any level below”, so “//title” will grab any node, from any path, that is tagged as “title”
    • The @ means to extract an attribute; so, //movie/@episode will create a node set of the episodes under the movie tags
  • Alternate ways to extract attributes include xml_attr() and xml_attrs()
    • xml_attrs() takes a nodeset and returns all of the attributes for every node in the nodeset
    • xml_attr() takes a nodeset and an additional argument attr to extract a single named argument from each node in the nodeset

Example code includes:

rev_history <- function(title, format = "json"){
  if (title != "Hadley Wickham") {
    stop('rev_history() only works for `title = "Hadley Wickham"`')
  }
  
  if (format == "json"){
    resp <- readRDS("had_rev_json.rds")
  } else if (format == "xml"){
    resp <- readRDS("had_rev_xml.rds")
  } else {
    stop('Invalid format supplied, try "json" or "xml"')
  }
  resp  
}

test_json <- "{\"continue\":{\"rvcontinue\":\"20150528042700|664370232\",\"continue\":\"||\"},\"query\":{\"pages\":{\"41916270\":{\"pageid\":41916270,\"ns\":0,\"title\":\"Hadley Wickham\",\"revisions\":[{\"user\":\"214.28.226.251\",\"anon\":\"\",\"timestamp\":\"2015-01-14T17:12:45Z\",\"comment\":\"\",\"contentformat\":\"text/x-wiki\",\"contentmodel\":\"wikitext\",\"*\":\"'''Hadley Mary Helen Wickham III''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\\\"about\\\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\\n\\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\\\"about\\\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>\"},{\"user\":\"73.183.151.193\",\"anon\":\"\",\"timestamp\":\"2015-01-15T15:49:34Z\",\"comment\":\"\",\"contentformat\":\"text/x-wiki\",\"contentmodel\":\"wikitext\",\"*\":\"'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\\\"about\\\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\\n\\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\\\"about\\\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>\"},{\"user\":\"FeanorStar7\",\"timestamp\":\"2015-01-24T16:34:31Z\",\"comment\":\"/* External links */ add LCCN and cats\",\"contentformat\":\"text/x-wiki\",\"contentmodel\":\"wikitext\",\"*\":\"'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\\\"about\\\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\\n\\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\\\"about\\\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>\"},{\"user\":\"KasparBot\",\"timestamp\":\"2015-04-26T19:18:17Z\",\"comment\":\"authority control moved to wikidata\",\"contentformat\":\"text/x-wiki\",\"contentmodel\":\"wikitext\",\"*\":\"'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\\\"about\\\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\\n\\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\\\"about\\\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>\"},{\"user\":\"Spkal\",\"timestamp\":\"2015-05-06T18:24:57Z\",\"comment\":\"/* Bibliography */  Added his new book, R Packages\",\"contentformat\":\"text/x-wiki\",\"contentmodel\":\"wikitext\",\"*\":\"'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\\\"about\\\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\\n\\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\\\"about\\\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>\"}]}}}}"

# Get revision history for "Hadley Wickham"
resp_json <- rev_history("Hadley Wickham")

# Check http_type() of resp_json
http_type(resp_json)

# Examine returned text with content()
content(resp_json, as="text")

# Parse response with content()
content(resp_json, as="parsed")

# Parse returned text with fromJSON()
library(jsonlite)
fromJSON(content(resp_json, as="text"))


# Load rlist
library(rlist)

# Examine output of this code
str(content(resp_json), max.level = 4)

# Store revision list
revs <- content(resp_json)$query$pages$`41916270`$revisions

# Extract the user element
user_time <- list.select(revs, user, timestamp)

# Print user_time
user_time

# Stack to turn into a data frame
list.stack(user_time)


# Load dplyr
library(dplyr)

# Pull out revision list
revs <- content(resp_json)$query$pages$`41916270`$revisions

# Extract user and timestamp
revs %>%
  bind_rows() %>%           
  select(user, timestamp)


# Load xml2
library(xml2)

# Get XML revision history
resp_xml <- rev_history("Hadley Wickham", format = "xml")

# Check response is XML 
http_type(resp_xml)

# Examine returned text with content()
rev_text <- content(resp_xml, as="text")
rev_text

# Turn rev_text into an XML document
rev_xml <- read_xml(rev_text)

# Examine the structure of rev_xml
str(rev_xml)


# Load xml2
library(xml2)

# Get XML revision history
resp_xml <- rev_history("Hadley Wickham", format = "xml")

# Check response is XML 
http_type(resp_xml)

# Examine returned text with content()
rev_text <- content(resp_xml, as="text")
rev_text

# Turn rev_text into an XML document
rev_xml <- read_xml(rev_text)

# Examine the structure of rev_xml
xml_structure(rev_xml)


# Find all nodes using XPATH "/api/query/pages/page/revisions/rev"
xml_find_all(rev_xml, "/api/query/pages/page/revisions/rev")

# Find all rev nodes anywhere in document
rev_nodes <- xml_find_all(rev_xml, "//rev")

# Use xml_text() to get text from rev_nodes
xml_text(rev_nodes)


# All rev nodes
rev_nodes <- xml_find_all(rev_xml, "//rev")

# The first rev node
first_rev_node <- xml_find_first(rev_xml, "//rev")

# Find all attributes with xml_attrs()
xml_attrs(first_rev_node)

# Find user attribute with xml_attr()
xml_attr(first_rev_node, attr="user")

# Find user attribute for all rev nodes
xml_attr(rev_nodes, attr="user")

# Find anon attribute for all rev nodes
xml_attr(rev_nodes, attr="anon")


get_revision_history <- function(article_title){
  # Get raw revision response
  rev_resp <- rev_history(article_title, format = "xml")
  
  # Turn the content() of rev_resp into XML
  rev_xml <- read_xml(content(rev_resp, "text"))
  
  # Find revision nodes
  rev_nodes <- xml_find_all(rev_xml, "//rev")

  # Parse out usernames
  user <- xml_attr(rev_nodes, attr="user")
  
  # Parse out timestamps
  timestamp <- readr::parse_datetime(xml_attr(rev_nodes, "timestamp"))
  
  # Parse out content
  content <- xml_text(rev_nodes)
  
  # Return data frame 
  data.frame(user = user,
    timestamp = timestamp,
    content = substr(content, 1, 40))
}

# Call function for "Hadley Wickham"
get_revision_history(article_title = "Hadley Wickham")

Chapter 4 - Web Scraping with XPATH

Web scraping 101 - sometimes a website does not have an API, so a different approach is required:

  • Web scraping is the process of grabbling the full html and then parsing the data as needed
  • The “selector” plug-in for a browser can be helpful for finding IDs associated with examples of interest
  • There is a package “rvest” that helps to simplify the process of web scraping
    • rvest::read_html(url=) # returns an XML document
    • html_node() will extract contents with XPATH (???) - the argument to html_node should be the returned XML document from the previous step

HTML structure - basically, content within tags, much like XML:

  • For example

    This is a test

    requests that “This is a test” be available in paragraph form
  • Attributes can be stored also, such as this is a test
  • Parameters can incorporate formatting, style, and the like
  • The rvest package has the means for extracting the data from html
    • html_text(x=) for text contents
    • html_attr(x=, name=) to get a specific attribute
    • html_name(x=) to get the tag name

Reformatting data (especially to a rectangular format such as a data frame):

  • Turning html tables (tables are a structure in html) in to data frames
    • They can be identified in raw html from
    • They can be turned in to tables using html_table()
    • Can assign column names using colnames() as per normal R
  • Turning html non-tables in to data frames
    • Use data.frame() with the vectors of text or names or attributes or the like

Example code includes:

# Load rvest
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
# Hadley Wickham's Wikipedia page
test_url <- "https://en.wikipedia.org/wiki/Hadley_Wickham"

# Read the URL stored as "test_url" with read_html()
test_xml <- read_html(test_url)

# Print test_xml
test_xml
## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
test_node_xpath <- "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"vcard\", \" \" ))]"
# Use html_node() to grab the node with the XPATH stored as `test_node_xpath`
node <- html_node(x = test_xml, xpath = test_node_xpath)

# Print the first element of the result
node[1]
## $node
## <pointer: 0x000000000b91bb80>
# The first thing we'll grab is a name, from the first element of the previously extracted table (now stored as table_element)
table_element <- node

# Extract the name of table_element
element_name <- html_name(table_element)

# Print the name
element_name
## [1] "table"
second_xpath_val <- "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"fn\", \" \" ))]"
# Extract the element of table_element referred to by second_xpath_val and store it as page_name
page_name <- html_node(x = table_element, xpath = second_xpath_val)

# Extract the text from page_name
page_title <- html_text(page_name)

# Print page_title
page_title
## [1] "Hadley Wickham"
# Turn table_element into a data frame and assign it to wiki_table
wiki_table <- html_table(table_element)

# Print wiki_table
wiki_table
##       Hadley Wickham
## 1                   
## 2          Residence
## 3         Alma mater
## 4          Known for
## 5             Awards
## 6  Scientific career
## 7             Fields
## 8             Thesis
## 9  Doctoral advisors
## 10 Doctoral students
## 11                  
##                                                                       Hadley Wickham
## 1                                                                                   
## 2                                                                      United States
## 3                                      Iowa State University, University of Auckland
## 4                                                    R programming language packages
## 5  John Chambers Award (2006)\nFellow of the American Statistical Association (2015)
## 6                                                                  Scientific career
## 7                                 Statistics\nData science\nR (programming language)
## 8                               Practical tools for exploring data and models (2008)
## 9                                                             Di Cook\nHeike Hofmann
## 10                                                                 Garrett Grolemund
## 11
# Rename the columns of wiki_table
colnames(wiki_table) <- c("key", "value")

# Remove the empty row from wiki_table
cleaned_table <- subset(wiki_table, !(key == ""))

# Print cleaned_table
cleaned_table
##                  key
## 2          Residence
## 3         Alma mater
## 4          Known for
## 5             Awards
## 6  Scientific career
## 7             Fields
## 8             Thesis
## 9  Doctoral advisors
## 10 Doctoral students
##                                                                                value
## 2                                                                      United States
## 3                                      Iowa State University, University of Auckland
## 4                                                    R programming language packages
## 5  John Chambers Award (2006)\nFellow of the American Statistical Association (2015)
## 6                                                                  Scientific career
## 7                                 Statistics\nData science\nR (programming language)
## 8                               Practical tools for exploring data and models (2008)
## 9                                                             Di Cook\nHeike Hofmann
## 10                                                                 Garrett Grolemund

Chapter 5 - CSS Web Scraping and Final Case Study

CSS (cascading style sheets) web scraping in theory:

  • CSS is for style, formatting, and the like
  • Groups of CSS commands are associated to a class, allowing the class to be used in multiple areas
    • .class_a { color: black; }
    • .class_b { color: red; }
    • Specific html can then be addressed using This is black
  • CSS scraping is the concept of finding the class groups
    • Works much like XPATH but will often grab many items rather than just a single element
    • It’s more common with CSS selectors to use html_nodes()
    • To select elements with a certain class, you add a . in front of the class name
    • If you need to select an element based on its id, you add a # in front of the id name
    • For example if this element was inside your HTML document:
    • Introduction
    • You could select it by its class using the CSS selector “.heading”, or by its id using the CSS selector “#intro”

Final case study: Introduction:

  • Extracting an infobox from a Wikipedia page
    1. Get XML by way of API
    2. Extract infobox from the page
    3. Clean up and convert to data frame
    4. Wrap in a function for reproducibility

Wrap up:

  • Downloading and reading flat files
  • Designing and using API clients
  • Web scraping using XPATHs and CSS

Example code includes:

library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
# Hadley Wickham's Wikipedia page
test_url <- "https://en.wikipedia.org/wiki/Hadley_Wickham"

# Read the URL stored as "test_url" with read_html()
test_xml <- read_html(test_url)

# Print test_xml
test_xml
## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
# Select the table elements
html_nodes(test_xml, css = "table")
## {xml_nodeset (2)}
## [1] <table class="infobox biography vcard" style="width:22em">\n<tr>\n<t ...
## [2] <table class="nowraplinks hlist navbox-inner" style="border-spacing: ...
# Select elements with class = "infobox"
html_nodes(test_xml, css = ".infobox")
## {xml_nodeset (1)}
## [1] <table class="infobox biography vcard" style="width:22em">\n<tr>\n<t ...
# Select elements with id = "firstHeading"
html_nodes(test_xml, css = "#firstHeading")
## {xml_nodeset (1)}
## [1] <h1 id="firstHeading" class="firstHeading" lang="en">Hadley Wickham< ...
# Extract element with class infobox
infobox_element <- html_nodes(test_xml, css = ".infobox")

# Get tag name of infobox_element
element_name <- html_name(infobox_element)

# Print element_name
element_name
## [1] "table"
# Extract element with class fn
page_name <- html_node(x = infobox_element, css=".fn")

# Get contents of page_name
page_title <- html_text(page_name)

# Print page_title
page_title
## [1] "Hadley Wickham"
# Load httr
library(httr)

# The API url
base_url <- "https://en.wikipedia.org/w/api.php"

# Set query parameters
query_params <- list(action="parse", 
  page="Hadley Wickham", 
  format="xml")

# Get data from API
resp <- GET(url = "https://en.wikipedia.org/w/api.php", query = query_params)
    
# Parse response
resp_xml <- content(resp)


# Load rvest
library(rvest)

# Read page contents as HTML
page_html <- read_html(xml_text(resp_xml))

# Extract infobox element
infobox_element <- html_node(page_html, css=".infobox")

# Extract page name element from infobox
page_name <- html_node(infobox_element, css=".fn")

# Extract page name as text
page_title <- html_text(page_name)


# Your code from earlier exercises
wiki_table <- html_table(infobox_element)
colnames(wiki_table) <- c("key", "value")
cleaned_table <- subset(wiki_table, !key == "")

# Create a dataframe for full name
name_df <- data.frame(key = "Full name", value = page_title)

# Combine name_df with cleaned_table
wiki_table2 <- rbind(name_df, cleaned_table)

# Print wiki_table
wiki_table2
##                  key
## 1          Full name
## 2          Residence
## 3         Alma mater
## 4          Known for
## 5             Awards
## 6  Scientific career
## 7             Fields
## 8             Thesis
## 9  Doctoral advisors
## 10 Doctoral students
##                                                                                value
## 1                                                                     Hadley Wickham
## 2                                                                      United States
## 3                                      Iowa State University, University of Auckland
## 4                                                    R programming language packages
## 5  John Chambers Award (2006)\nFellow of the American Statistical Association (2015)
## 6                                                                  Scientific career
## 7                                 Statistics\nData science\nR (programming language)
## 8                               Practical tools for exploring data and models (2008)
## 9                                                             Di Cook\nHeike Hofmann
## 10                                                                 Garrett Grolemund
library(httr)
library(rvest)
library(xml2)

get_infobox <- function(title){
  base_url <- "https://en.wikipedia.org/w/api.php"
  
  # Change "Hadley Wickham" to title
  query_params <- list(action = "parse", 
    page = title, 
    format = "xml")
  
  resp <- GET(url = base_url, query = query_params)
  resp_xml <- content(resp)
  
  page_html <- read_html(xml_text(resp_xml))
  infobox_element <- html_node(x = page_html, css =".infobox")
  page_name <- html_node(x = infobox_element, css = ".fn")
  page_title <- html_text(page_name)
  
  wiki_table <- html_table(infobox_element)
  colnames(wiki_table) <- c("key", "value")
  cleaned_table <- subset(wiki_table, !wiki_table$key == "")
  name_df <- data.frame(key = "Full name", value = page_title)
  wiki_table <- rbind(name_df, cleaned_table)
  
  wiki_table
}

# Test get_infobox with "Hadley Wickham"
get_infobox(title = "Hadley Wickham")
##                  key
## 1          Full name
## 2          Residence
## 3         Alma mater
## 4          Known for
## 5             Awards
## 6  Scientific career
## 7             Fields
## 8             Thesis
## 9  Doctoral advisors
## 10 Doctoral students
##                                                                                value
## 1                                                                     Hadley Wickham
## 2                                                                      United States
## 3                                      Iowa State University, University of Auckland
## 4                                                    R programming language packages
## 5  John Chambers Award (2006)\nFellow of the American Statistical Association (2015)
## 6                                                                  Scientific career
## 7                                 Statistics\nData science\nR (programming language)
## 8                               Practical tools for exploring data and models (2008)
## 9                                                             Di Cook\nHeike Hofmann
## 10                                                                 Garrett Grolemund
# Try get_infobox with "Ross Ihaka"
get_infobox(title = "Ross Ihaka")
##                                                 key
## 1                                         Full name
## 2  Ihaka at the 2010 New Zealand Open Source Awards
## 3                                         Residence
## 4                                        Alma mater
## 5                                         Known for
## 6                                            Awards
## 7                                 Scientific career
## 8                                            Fields
## 9                                      Institutions
## 10                                           Thesis
## 11                                 Doctoral advisor
##                                                       value
## 1                                                Ross Ihaka
## 2          Ihaka at the 2010 New Zealand Open Source Awards
## 3                                     Auckland, New Zealand
## 4  University of AucklandUniversity of California, Berkeley
## 5                                    R programming language
## 6                                    Pickering Medal (2008)
## 7                                         Scientific career
## 8                                     Statistical Computing
## 9                                    University of Auckland
## 10                                          Ruaumoko (1985)
## 11                                      David R. Brillinger
# Try get_infobox with "Grace Hopper"
get_infobox(title = "Grace Hopper")
##                                   key
## 1                           Full name
## 2  Rear Admiral Grace M. Hopper, 1984
## 3                         Nickname(s)
## 4                                Born
## 5                                Died
## 6                     Place of burial
## 7                          Allegiance
## 8                      Service/branch
## 9                    Years of service
## 10                               Rank
## 11                             Awards
##                                                                                                                                                                                                                                                                                   value
## 1                                                                                                                                                                                                                                                                   Grace Murray Hopper
## 2                                                                                                                                                                                                                                                    Rear Admiral Grace M. Hopper, 1984
## 3                                                                                                                                                                                                                                                                       "Amazing Grace"
## 4                                                                                                                                                                                                                           (1906-12-09)December 9, 1906\nNew York City, New York, U.S.
## 5                                                                                                                                                                                                                        January 1, 1992(1992-01-01) (aged 85)Arlington, Virginia, U.S.
## 6                                                                                                                                                                                                                                                           Arlington National Cemetery
## 7                                                                                                                                                                                                                                                              United States of America
## 8                                                                                                                                                                                                                                                                    United States Navy
## 9                                                                                                                                                                                                                                                       1943–1966, 1967–1971, 1972–1986
## 10                                                                                                                                                                                                                                                            Rear admiral (lower half)
## 11 Defense Distinguished Service Medal Legion of Merit Meritorious Service Medal American Campaign Medal World War II Victory Medal National Defense Service Medal Armed Forces Reserve Medal with two Hourglass Devices Naval Reserve Medal Presidential Medal of Freedom (posthumous)

Data Visualization in R with lattice

Chapter 1 - Basic plotting with lattice

Introduction - general objectives:

  • Visualization may be for EDA or for reporting results
  • Three basic graphing capabilities in R
    • Base - powerful but not flexible
    • lattice - based on “Trellis graphics” (Cleveland)
    • ggplot2 - based on “Grammar of Graphics” (Wilkinson)
  • This course will cover lattice graphics for both EDA and reporting
  • Focus will be on the USCancerRates dataset, with exploration of variance by gender and location
    • histogram(~ x, data=) # lattice for make a histogram (default appears to be RELATIVE frequency by bin)
    • xyplot(y ~ x, data=) # lattice for make an xy plot
    • The modeling calls are similar to what would be seen in an lm()

Optional arguments:

  • Plotting functions in lattice frequently require two arguments - formula and data set
  • Additional options are available and can be supplied to certain functions
    • For example, histogram(~ x, data=, main=, xlab=) # will give the plot title “main” and the X-axis label “xlab”
    • xyplot can also have a ylab=
    • histogram can also have nint= (specifies the number of bins)
    • The grid= argument of xyplot adds a background grid, while abline= adds a line with slope and intercept as specified
  • In the case of histogram(), the optional argument type controls what is plotted on the y-axis. It can take three values:
    • “percent”, the default, gives percentage or relative frequency
    • “count” gives bin count, which is the default in hist()
    • “density” gives a density histogram
  • The lattice function densityplot() creates kernel density plots (formula interface is similar to that of histogram())
    • the formula should be written as ~ x to plot the values of the x column along the x-axis, and the estimated density on the y-axis
    • A useful optional argument for densityplot() is plot.points, which can take values
    • TRUE, the default, to plot the data points along the x-axis in addition to the density
    • FALSE to suppress plotting the data points
    • “jitter”, to plot the points along the y-axis but with some random jittering in the y-direction so that overlapping points are easier to see

Box and whisker plots and reordering elements:

  • Box and whisker plots are formed using bwplot(~ x, data=)
  • Can serve a similar purpose as a histogram or density plot, and the formula is correspondingly similar
    • bwplot(y ~ x, data=) will make box plots for x, split by each level of y (which needs to be a factor/categorical)
  • The function reorder(myFactor, myData, myFunction, … ) will reorder factor variables for plotting
    • For example, reorder(state, rate.male, median, na.rm=TRUE) will order the factor variable state by median(rate.male) in that state
  • Your task for this exercise is to produce a box-and-whisker plot where the whiskers extend to the data extremes
    • These calculations are controlled by the coef argument of the R helper function boxplot.stats()
    • A positive value of coef makes the whiskers extend to no more than coef times the length of the box
    • The value of coef = 0 makes the whiskers extend to the data extremes

Example code includes:

data(airquality)
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
# Load the lattice package
library(lattice)


# Create the histogram 
histogram(~ Ozone, data = airquality)

# Create the histogram
histogram(~ Ozone, data = airquality, 
          # Specify number of bins
          nint = 15,
          # Specify quantity displayed on y-axis
          type = "count")

# Create the scatter plot
xyplot(Ozone ~ Solar.R, data = airquality)

# Create scatterplot
xyplot(Ozone ~ Temp, data = airquality,
       # Add main label
       main = "Environmental conditions in New York City (1973)", 
       # Add axis labels
       ylab = "Ozone (ppb)",
       xlab = "Temperature (Fahrenheit)")

# Create a density plot
densityplot(~ Ozone, data = airquality, 
    # Choose how raw data is shown
    plot.points = "jitter")

data(USCancerRates, package="latticeExtra")
str(USCancerRates)
## 'data.frame':    3041 obs. of  8 variables:
##  $ rate.male   : num  364 346 341 336 330 ...
##  $ LCL95.male  : num  311 274 304 289 293 ...
##  $ UCL95.male  : num  423 431 381 389 371 ...
##  $ rate.female : num  151 140 182 185 172 ...
##  $ LCL95.female: num  124 103 161 157 151 ...
##  $ UCL95.female: num  184 190 206 218 195 ...
##  $ state       : Factor w/ 49 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ county      :Class 'AsIs'  chr [1:3041] "Pickens County" "Bullock County" "Russell County" "Barbour County" ...
rn_USCR <- row.names(USCancerRates)

# Create reordered variable
library(dplyr)
USCancerRates <-
    mutate(USCancerRates, 
           state.ordered = reorder(state, rate.female, median, na.rm = TRUE)
           )

# Create box and whisker plot
bwplot(state.ordered ~ rate.female, data = USCancerRates)

# Create box and whisker plot
bwplot(state.ordered ~ rate.female, data = USCancerRates, 
       # Change whiskers extent
       coef = 0)


Chapter 2 - Conditioning and the Formula Interface

Conditioning - identify sources of variability in the data by examining sub-groups:

  • Small multiple design - conditioning/faceting approach
  • The conditioning operator in lattice is the single-pipe (|)
    • xyplot(y ~ x | c, data=) # co is the conditioning variable in this example
    • Can use the conditioning operator in any function within the lattice framework
  • The plus (+) operator is another way to condition - means condition on more than one variable
    • histogram(~ a + b, outer=TRUE, layout=c(1, 2), data=) will put a separate histogram for b below the separate histogram for a, keeping both on the same scale
    • The outer command determines how to interpret a+b
    • The layout=c(1, 2) means 1 column and 2 rows - general format is layout=c(ncol, nrow, npages)
  • Since count-based functions tend to have higher variances associated to higher means, the log transform for these can be valuable
  • lattice, unlike ggplot2, allows you to have data in a wide format

Data summary and transformation - grouping:

  • Data summarization can be especially valuable for reporting
    • For example, may want to summarize cancer rates by state (median county) rather than by county
    • The tapply() function can be valuable for applying a function across a vector
    • To get both genders on the same plot but in different colors, use xyplot(State ~ Rate, data=, grid=TRUE, groups=Gender) # will treat the Gender as a separate group with different color on the same plot
  • New concept: groups - interpreted as a factor that defines sub-groups
    • xyplot() and densityplot() support this, while histogram() does not
    • Using auto.key = TRUE will add a legend telling which colors are associated to which groups
  • For more detailed control, the auto.key argument can be a list with various sub-components, the most useful of which are
    • space: which can be “left”, “right”, “top”, or “bottom”
    • columns: specifies the number of columns in which to divide up the levels
    • title: specifies a title for the legend

Incorporating external data sources:

  • Can potentially split panels in to multiple pages or place multiple plots in the same pane
    • For eample, could aggregate states by region and report states in the same region together
    • The layout argument inside a lattice plotting function calls for layout=c(ncol, nrow)
    • The between argument inside a lattice plotting function calls for spacing - bewteen=list(y=c(0, 0, 1, 0, 0)) will put a space of 1 between the third and fourth items
  • The outer=FALSE makes the conditioning variable in to a grouping variable - more effective visual with multiple plots together on the same pane
  • In a conditioned lattice plot, the panels are by default drawn starting from the bottom-left position, going right and then up
    • This is patterned on the Cartesian coordinate system where the x-axis increases to the right and the y-axis increases from bottom to top
  • Often, want to change this so that the layout is similar to a matrix or table, where rows start at the top
    • The layout of any conditioned lattice plot can be changed to follow this scheme by adding the optional argument as.table = TRUE

The trellis object - lattice creates trellis objects rather than directly creating plots (as in base R):

  • Can run the class(), summary() and the like, with auto-print and/or print() making the plot visible
  • If you have a trellis object, the update() command can be used to modify the object
    • In particular, their dimnames() are used as strip labels
  • Can think of the trellis object as being like a matrix, so t(trellisObject) will flip the rows/columns
  • Depending on the amount of space available, a conditioned plot may have too many combinations to be displayed effectively
    • Such plots can be split into multiple pages using the layout argument
    • But another convenient way to explore large lattice plots is to subset them like a matrix or array, using the [ indexing operator, to display only parts of the plot at a time

Example code includes:

# The airquality dataset has been pre-loaded
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
# Create a histogram
histogram(~ Ozone | factor(Month),
          data = airquality, 
          # Define the layout
          layout=c(2, 3),
          # Change the x-axis label
          xlab="Ozone (ppb)")

# USCancerRates has been pre-loaded
str(USCancerRates)
## 'data.frame':    3041 obs. of  9 variables:
##  $ rate.male    : num  364 346 341 336 330 ...
##  $ LCL95.male   : num  311 274 304 289 293 ...
##  $ UCL95.male   : num  423 431 381 389 371 ...
##  $ rate.female  : num  151 140 182 185 172 ...
##  $ LCL95.female : num  124 103 161 157 151 ...
##  $ UCL95.female : num  184 190 206 218 195 ...
##  $ state        : Factor w/ 49 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ county       :Class 'AsIs'  chr [1:3041] "Pickens County" "Bullock County" "Russell County" "Barbour County" ...
##  $ state.ordered: Factor w/ 49 levels "Utah","New Mexico",..: 25 25 25 25 25 25 25 25 25 25 ...
##   ..- attr(*, "scores")= num [1:49(1d)] 166 166 145 169 159 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
# Create a density plot
densityplot(~ rate.male + rate.female,
    data = USCancerRates, 
    outer = TRUE,
    # Suppress data points
    plot.points = FALSE,
    # Add a reference line
    ref=TRUE)

# Create a density plot
densityplot(~ rate.male + rate.female,
    data = USCancerRates,
    # Set value of 'outer' 
    outer=FALSE,
    # Add x-axis label
    xlab="Rate (per 100,000)",
    # Add a legend
    auto.key=TRUE,
    plot.points = FALSE,
    ref = TRUE)

xyplot(Ozone ~ Temp, airquality, groups = Month,
       # Complete the legend spec
       auto.key = list(space = "right", 
                       title = "Month", 
                       text = month.name[5:9]))

USCancerRates <- USCancerRates %>%
    mutate(division=state.division[match(state, state.name)])

# Create 'division.ordered' by reordering levels
USCancerRates <- 
  mutate(USCancerRates, 
         division.ordered = reorder(division, 
                                    rate.male + rate.female, 
                                    mean, na.rm = TRUE))

# Create conditioned scatter plot
xyplot(rate.female ~ rate.male | division.ordered,
       data = USCancerRates, 
       # Add reference grid
       grid = TRUE, 
       # Add reference line
       abline = c(0, 1))

# Levels of division.ordered
levels(USCancerRates$division.ordered)
## [1] "Mountain"           "West North Central" "Pacific"           
## [4] "Middle Atlantic"    "New England"        "East North Central"
## [7] "West South Central" "South Atlantic"     "East South Central"
# Specify the as.table argument 
xyplot(rate.female ~ rate.male | division.ordered,
       data = USCancerRates, 
       grid = TRUE, abline = c(0, 1),
       as.table=TRUE)

# Create box-and-whisker plot
bwplot(division.ordered ~ rate.male + rate.female,
       data = USCancerRates, 
       outer = TRUE, 
       # Add a label for the x-axis
       xlab="Rate (per 100,000)",
       # Add strip labels
       strip = strip.custom(factor.levels = c("Male", "Female")))

# Create "trellis" object
tplot <-
    densityplot(~ rate.male + rate.female | division.ordered, 
                data = USCancerRates, outer = TRUE, 
                plot.points = FALSE, as.table = TRUE)

# Change names for the second dimension
dimnames(tplot)[[2]] <- c("Male", "Female")

# Update x-axis label and plot
update(tplot, xlab = "Rate")

# Create "trellis" object
tplot <-
    densityplot(~ rate.male + rate.female | division.ordered, 
                data = USCancerRates, outer = TRUE, 
                plot.points = FALSE, as.table = TRUE)

# Inspect dimension
dim(tplot)
## [1] 9 2
dimnames(tplot)
## $division.ordered
## [1] "Mountain"           "West North Central" "Pacific"           
## [4] "Middle Atlantic"    "New England"        "East North Central"
## [7] "West South Central" "South Atlantic"     "East South Central"
## 
## [[2]]
## [1] "rate.male"   "rate.female"
# Select subset retaining only last three divisions
tplot[7:9, ]


Chapter 3 - Controlling scales and graphical parameters

Combining scales:

  • Can use dotplot(y ~ x | c + d, data=, as.table=TRUE) to have a conditioned dot-plot on c and d
    • Expects a categorical variable on at least one of the axes (typically, but not always, y)
  • The default for axis limits is for them to be common across all the panels - typically, best for interpretation
    • Can override the default behavior using the scales argument, a list with named components
    • relation = “same” is the default
    • relation = “free” allows independence for each panel
    • relation = “sliced” allows different limits for each panel, but with same range (???)
  • The call to scales is fairly complicated
    • scales = list(x = list(relation = “free”)) # asks for an x-axis scale to be free
  • Some other useful sub-components of the scales argument are:
    • tick.number: approximate number of tick marks / labels
    • alternating: 1 puts labels on the left/bottom boundary, 2 on the right/top, and 3 on both sides. The value can be a vector, in which case it applies row-wise or column-wise
    • rot: angle in degrees to rotate axis labels

Logarithmic scales:

  • Can use dotplot(y ~ x | c, data=, groups=d, as.table=TRUE) will use d as a grouping variable with the plots only conditioned on c
  • Can use the log() transform directly on the y variable to help with visualizing the data
  • Alternately, can keep the data the same but just stretch the scales
    • dotplot(y ~ x | c, data=, groups=d, scales=list(x = list(log = 2, equispaced.log=FALSE)), auto.key=list(columns=2))
  • There is one more component you need to know, equispaced.log
    • This component indicates if the tick marks are equispaced when log scales are in use
    • By default, equispaced.log is set to TRUE
    • Note: If you set equispaced.log = FALSE, you don’t have to explicitly specify a base for the log component; just log = TRUE should do the trick!

Graphical parameters:

  • A collection of graphical parameters is referred to as a theme, frequently stored globally so it can be easily re-used
  • The trellis.par.set(myTheme) will work to set myTheme as the theme for the upcoming plot
    • The latticeExtra package has ggplot2like() which will help match up the ggplot2 defaults
  • Can also control graphical parameters by way of calls within a graphin function
    • For example, pch=15, col=c(“red”, “blue”)
  • Changing the graphical theme using trellis.par.set(), as demonstrated in the preceding video, makes the changes permanent, applying to all subsequent plots, until the theme is reset
    • If you wish to make changes for a specific plot, an easier alternative is to supply the theme as the optional argument par.settings to a high-level call
    • In that case, the settings will be associated only to that particular call
    • In this exercise, you will use this approach to create a dot plot of the WorldPhones data with the ggplot2like() theme
    • As we saw earlier, changing the theme alone may be insufficient; we also need to change other things like the spacing between panels
    • Such settings (which are not considered graphical parameters) can also be customized through a list of “options”
    • To go with the ggplot2like() theme, the latticeExtra package also provides a suitable list of options, produced by ggplot2like.opts()
  • Options can be associated to a particular plot by specifying it as the lattice.options argument in a high-level call, or set more permanently using the lattice.options() function

Using simpleTheme():

  • Empty circles are the default plotting symbol
    • The pch=16 will create filled-in circles
  • Interesting, changing parameters like pch in the function call apply only to the data, not to the legend describing the data
    • Can instead make changes that apply to everything by specifying (inside the function) par.settings = simpleTheme(pch=16, col=c(“red”, “blue”))
    • The simpleTheme() call will only change the requested options, leaving the global theme for everything else

Example code includes:

# The lattice package and the USMortality dataset have been pre-loaded.
Status <- factor(c('Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural', 'Urban', 'Rural'), levels=c("Rural", "Urban")
                 )
Sex <- factor(c('Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female'), levels=c("Female", "Male")
              )
Cause <- factor(c('Heart disease', 'Heart disease', 'Heart disease', 'Heart disease', 'Cancer', 'Cancer', 'Cancer', 'Cancer', 'Lower respiratory', 'Lower respiratory', 'Lower respiratory', 'Lower respiratory', 'Unintentional injuries', 'Unintentional injuries', 'Unintentional injuries', 'Unintentional injuries', 'Cerebrovascular diseases', 'Cerebrovascular diseases', 'Cerebrovascular diseases', 'Cerebrovascular diseases', 'Alzheimers', 'Alzheimers', 'Alzheimers', 'Alzheimers', 'Diabetes', 'Diabetes', 'Diabetes', 'Diabetes', 'Flu and pneumonia', 'Flu and pneumonia', 'Flu and pneumonia', 'Flu and pneumonia', 'Suicide', 'Suicide', 'Suicide', 'Suicide', 'Nephritis', 'Nephritis', 'Nephritis', 'Nephritis'), 
                levels=c('Alzheimers', 'Cancer', 'Cerebrovascular diseases', 'Diabetes', 'Flu and pneumonia', 'Heart disease', 'Lower respiratory', 'Nephritis', 'Suicide', 'Unintentional injuries')
                )
Rate <- c(210.2, 242.7, 132.5, 154.9, 195.9, 219.3, 140.2, 150.8, 44.5, 62.8, 36.5, 46.9, 49.6, 71.3, 24.7, 37.2, 36.1, 42.2, 34.9, 42.2, 19.4, 21.8, 25.5, 30.6, 24.9, 29.5, 17.1, 21.8, 17.7, 20.8, 12.9, 16.3, 19.2, 26.3, 5.3, 6.2, 15.7, 18.3, 10.7, 13.9)
SE <- c(0.2, 0.6, 0.2, 0.4, 0.2, 0.5, 0.2, 0.4, 0.1, 0.3, 0.1, 0.2, 0.1, 0.3, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.1, 0.1, 0.2, 0, 0.1, 0.1, 0.2, 0, 0.1)

USMortality <- data.frame(Status=Status, Sex=Sex, Cause=Cause, Rate=Rate, SE=SE)


# Specify upper bound to exclude Heart disease and Cancer
x_limits <- c(0, 100)

# Draw the plot
dotplot(Cause ~ Rate | Sex + Status, data = USMortality, as.table = TRUE, 
        xlim = x_limits)

dotplot(Cause ~ Rate | Sex + Status, data = USMortality,
        as.table = TRUE,
        scales = list(x = list(relation = "free",
                               # Specify limits for each panel
                               limits = list(c(0, 50), c(0, 80), 
                                             c(0, 50), c(0, 80) ))))

dotplot(Cause ~ Rate | Sex + Status, data = USMortality, 
        as.table = TRUE,
        # Change the number of tick marks
        scales = list(x = list(tick.number = 10, 
                               # Show `Rate` labels on both bottom and top
                               alternating = 3, 
                               # Rotate `Rate` labels by 90 degrees
                               rot = 90),
                      # Rotate `Cause` labels by 45 degrees
                      y = list(rot = 45)))

# Define at as 2^3 up to 2^8
x_ticks_at <- 2 ** (3:8)

dotplot(Cause ~ Rate | Sex, data = USMortality,
        groups = Status, auto.key = list(columns = 2),
        scales = list(x = list(log = 2, 
                               # A numeric vector with 
                               # values 2^3, 2^4, ..., 2^8
                               at = x_ticks_at, 
                               # A character vector, 
                               # "8" for 2^3, "16" for 2^4, etc.
                               labels = x_ticks_at)))

# Create the dot plot
dotplot(Cause ~ Rate | Status, data = USMortality,
        groups = Sex, auto.key = list(columns = 2),
        scales = list(x = list(log = TRUE, 
                      equispaced.log = FALSE)), 
        # Provide pch values for the two groups
        pch = c(3, 1))

dotplot(Cause ~ Rate | Status, data = USMortality,
        groups = Sex, auto.key = list(columns = 2),
        par.settings = simpleTheme(pch = c(3, 1)),
        scales = list(x = list(log = 2, equispaced.log = FALSE)))

# The WorldPhones matrix is already provided, with the first row removed so you only need consider consecutive years
data(WorldPhones)
WorldPhones <- WorldPhones[row.names(WorldPhones) != 1951, ]
WorldPhones
##      N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
## 1956  60423  29990 4708   2568    2366   1411      733
## 1957  64721  32510 5230   2695    2526   1546      773
## 1958  68484  35218 6662   2845    2691   1663      836
## 1959  71799  37598 6856   3000    2868   1769      911
## 1960  76036  40341 8220   3145    3054   1905     1008
## 1961  79831  43173 9053   3338    3224   2005     1076
names(dimnames(WorldPhones)) <- c("Year", "Region")

# Transform matrix data to data frame
WorldPhonesDF <- as.data.frame(
                   # Intermediate step: convert to table
                   as.table(WorldPhones), 
                   responseName = "Phones")

# Create the dot plot
dotplot(Year ~ Phones | Region, 
        data = WorldPhonesDF, 
        as.table = TRUE,
        # Log-transform the x-axis
        scales = list(x = list(log = TRUE,
                               equispaced.log = FALSE, 
                               # Set x-axis relation to "sliced"
                               relation = "sliced")))

# Load latticeExtra package for ggplot2like()
library(latticeExtra)
## Loading required package: RColorBrewer
## 
## Attaching package: 'latticeExtra'
## The following object is masked _by_ '.GlobalEnv':
## 
##     USCancerRates
## The following object is masked from 'package:ggplot2':
## 
##     layer
# Transform matrix data to data frame
names(dimnames(WorldPhones)) <- c("Year", "Region")
WorldPhonesDF <- 
  as.data.frame(as.table(WorldPhones[-1, ]), 
                responseName = "Phones")

# Create the dot plot
dotplot(Year ~ Phones | Region,
        data = WorldPhonesDF, 
        as.table = TRUE,
        scales = list(x = list(log = TRUE,
                               equispaced.log = FALSE, 
                               relation = "sliced")),
        # Fill in suitable value of par.settings
        par.settings = ggplot2like(),
        # Fill in suitable value of lattice.options
        lattice.options = ggplot2like.opts())

# Create factor variable
airquality$Month.Name <- 
  factor(airquality$Month, levels = 1:12, 
         labels = month.name[1:12])
         
# Create histogram of Ozone, conditioning on Month
histogram(~ Ozone | Month.Name,
          data = airquality, as.table = TRUE,
          # Set border to be transparent
          border = "transparent", 
          # Set fill color to be mid-gray
          col = "grey50")

# Create factor variable
airquality$Month.Name <- 
  factor(airquality$Month, levels = 1:12, 
         labels = month.name)
levels(airquality$Month.Name)
##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"
# Drop empty levels
airquality$Month.Name <- droplevels(airquality$Month.Name)
levels(airquality$Month.Name)
## [1] "May"       "June"      "July"      "August"    "September"
# Obtain colors from RColorBrewer
library(RColorBrewer)
my.colors <- brewer.pal(n = 5, name = "Set1")

# Density plot of ozone concentration grouped by month
densityplot(~ Ozone, data = airquality, groups = Month.Name,
            plot.points = FALSE,
            auto.key = list(space = "right"),
            # Fill in value of col
            par.settings = simpleTheme(col = my.colors, 
                                       # Fill in value of lwd
                                       lwd = 2))


Chapter 4 - Customizing plots using panel functions

Panel functions:

  • Declarative approach (you provide specifications, system figures out requirements) is used by ggplot2
  • Procedural approach (you provide step-by-step) is used by lattice and base R
  • Custom displays in lattice cannot be created directly by faceting; instead, function build-up is needed
    • panel.histdens <- function(x, …) { panel.histogram(x, …) ; panel.lines(density(x, na.rm=TRUE)) } # overlay a density with a density histogram
    • The panel.histdens is then used inside the function, such as panel=panel.histdens inside a call to histogram()
    • The base R functions like lines and points need to be replaced by their lattice equivalents like panel.lines and panel.points for the lattice code to work
  • Rather than customizing a default display, sometimes you may want to replace it entirely
    • An example of this is a violin plot, which is structured like a box and whisker plot, but instead of the boxes and whiskers, it uses kernel density estimates to summarize a distribution
    • The resulting plot retains the compactness of a box and whisker plot, but also shows features like bimodality
    • The built-in function panel.violin() in the lattice package implements the display of violin plots
  • When there are a large number of points in the data, there may be substantial overplotting in a standard scatter plot
    • Another built-in panel function available in the lattice package that can serve as a replacement for panel.xyplot() in such cases is panel.smoothScatter()
    • Instead of plotting the points directly, it uses a color gradient to show a 2-D kernel density estimate obtained from the data
    • xyplot(rate.female ~ rate.male, data = USCancerRates, panel = panel.smoothScatter, scales = list(log = TRUE, equispaced.log = FALSE), main = “County-wise deaths due to cancer”)

Prepanel Functions to control limits:

  • Controlling the x/y axis limits is enabled within lattice
    • prepanel.histdens.1 <- function(x, …) { d <- density(x, na.rm = TRUE); list(ylim = c(0, max(d$y))) }
    • histogram(~ rate.male + rate.female, USCancerRates, type = “density”, scales = list(x = list(log = 10)), xlab = “Rate”, panel = panel.histdens, prepanel = prepanel.histdens.1)

Optional arguments of default panel functions:

  • Some optional arguments are common to all high-level functions in lattice - xlab, ylab, main, layout, between, scales
  • Some optional arguments are specific to a single high-level function
    • nint, type - histogram()
    • plot.points, ref - densityplot()
    • grad, abline - xyplot()
    • col, cwd, cex, pch
  • The high-level functions will handle the general arguments, while sweeping up all the others for passage to the panel functions
    • For example, passing grid=TRUE in xyplot() passes the argument to panel.xyplot()
  • The type argument for xyplot adds a number of arguments
    • “p” - points
    • “l” - lines
    • “r” - regresion by way of panel.lmline()
    • “smooth” - LOESS smooth by way of panel.loess()
    • “a” - join average y values for each unique x value by way of panel.average()
    • Multiple types can be specified as a vector
  • Note a few following features for an xyplot()
    • grid = list(h=-1, v=0) # draws horizontal reference lines
    • type = c(“p”, “a”) draws the points and a line connecting their averages
    • jitter.x = TRUE will apply a jitter on the x-axis only
  • The default panel function for bwplot() has two additional arguments that you have not used before:
    • pch = “|” replaces the black dot representing the median inside the box by a line segment dividing the box into two smaller rectangles
    • notch = TRUE puts “notches” on the side of the boxes that indicate a confidence interval for median
    • the overlapping of notches for two subgroups suggests that the true medians of the two subgroups are not significantly different
  • For the last exercise in this chapter, your task is to recreate a grouped dot plot you have seen before, but replace the plotting characters by emoji images
    • To do so, you will use the panel.xyimage() function in the latticeExtra package, which is similar to the panel.xyplot() function,
    • except that plotting symbols are replaced by images whose locations (file names or URLs of JPEG or PNG image files) are specified as the pch argument

Example code includes:

panel.xyrug <- function(x, y, ...)
{
  # Reproduce standard scatter plot
  panel.xyplot(x, y, ...)
  
  # Identify observations with x-value missing
  x.missing <- is.na(x)
  
  # Identify observations with y-value missing
  y.missing <- is.na(y)
  
  # Draw rugs along axes
  panel.rug(x = x[y.missing], y = y[x.missing])
}

airquality$Month.Name <- 
    factor(month.name[airquality$Month], levels = month.name)
    
xyplot(Ozone ~ Solar.R | Month.Name, data = airquality,
       panel = panel.xyrug, as.table = TRUE)

# Create factor variable with month names
airquality$Month.Name <- 
  factor(month.name[airquality$Month], levels = month.name)

# Create box-and-whisker plot
bwplot(Month.Name ~ Ozone + Temp, airquality, 
       # Specify outer
       outer=TRUE, 
       # Specify x-axis relation
       scales = list(x = list(relation="free")),
       # Specify layout
       layout=c(2, 1),
       # Specify x-axis label
       xlab="Measured value")

# Create violin plot
bwplot(Month.Name ~ Ozone + Temp, airquality, 
       # Specify outer
       outer = TRUE, 
       # Specify x-axis relation
       scales = list(x = list(relation="free")),
       # Specify layout
       layout=c(2, 1),
       # Specify x-axis label
       xlab="Measured value",
       # Replace default panel function
       panel = panel.violin)

# Create panel function
panel.ss <- function(x, y, ...) {
  # Call panel.smoothScatter()
  panel.smoothScatter(x, y, ...)
  # Call panel.loess()
  panel.loess(x, y, col = "red")
  # Call panel.abline()
  panel.abline(0, 1)
}

# Create plot
xyplot(rate.female ~ rate.male, data = USCancerRates,
       panel = panel.ss,
       main = "County-wise deaths due to cancer")
## (loaded the KernSmooth namespace)

# Define prepanel function
prepanel.histdens.2 <- function(x, ...) {
    h <- prepanel.default.histogram(x, ...)
    d <- density(x, na.rm = TRUE)
    list(xlim = quantile(x, c(0.005, 0.995), na.rm = TRUE),
         # Calculate upper y-limit
         ylim = c(0, max(d$y, h$ylim[2])))
}

panel.histdens <- function(x, ...) {
    panel.histogram(x, ...)
    panel.lines(density(x, na.rm = TRUE))
}

# Create a histogram of rate.male and rate.female
histogram(~ rate.male + rate.female,
          data = USCancerRates, outer = TRUE,
          type = "density", nint = 50,
          border = "transparent", col = "lightblue",
          # The panel function: panel.histdens
          panel = panel.histdens, 
          # The prepanel function: prepanel.histdens.2
          prepanel = prepanel.histdens.2,
          # Ensure that the x-axis is log-transformed
          # and has relation "sliced"
          scales = list(x = list(log = TRUE,
                                 equispaced.log = FALSE,
                                 relation = "sliced")),
          xlab = "Rate (per 100,000)")

# Create the box and whisker plot
bwplot(division.ordered ~ rate.male, 
       data = USCancerRates,
       # Indicate median by line instead of dot
       pch = "|", 
       # Include notches for confidence interval
       notch = TRUE,
       # The x-axis should plot log-transformed values
       scales = list(x = list(log=TRUE, equispaced.log=FALSE)),
       xlab = "Death Rate in Males (per 100,000)")

# Load the 'latticeExtra' package
library(latticeExtra)

# Create summary dataset
USCancerRates.state <- 
   with(USCancerRates, {
     rmale <- tapply(rate.male, state, median, na.rm = TRUE)
     rfemale <- tapply(rate.female, state, median, na.rm = TRUE)
     data.frame(Rate = c(rmale, rfemale),
                State = rep(names(rmale), 2),
                Gender = rep(c("Male", "Female"), 
                             each = length(rmale)))
  })

# Reorder levels
library(dplyr)
USCancerRates.state <- 
   mutate(USCancerRates.state, State = reorder(State, Rate))
head(USCancerRates.state)
##     Rate      State Gender
## 1 286.00    Alabama   Male
## 2 237.95     Alaska   Male
## 3 209.30    Arizona   Male
## 4 284.10   Arkansas   Male
## 5 221.30 California   Male
## 6 204.40   Colorado   Male
# URLs for emojis
emoji.man <- "https://twemoji.maxcdn.com/72x72/1f468.png"
emoji.woman <- "https://twemoji.maxcdn.com/72x72/1f469.png"

# Create dotplot
# dotplot(State ~ Rate, data = USCancerRates.state, 
        # Specify grouping variable
#         groups = Gender, 
        # Specify panel function
#         panel = panel.xyimage, 
        # Specify emoji URLs
#         pch = c(emoji.woman, emoji.man),
        # Make symbols smaller
#         cex = 0.75)

Chapter 5 - Extensions and the lattice ecosystem

New methods - lattice is used by many packages because it is highly extensible:

  • High-level lattice functions are “generic functions”, and the first argument need not be a formula
  • For example, dotplot() can be applied directly to a table
    • For example, dotPlot(worldPhones[-1, ], scales=list(x=list(log=2)), groups=FALSE, layout=c(1, NA), strip=FALSE, strip.left=TRUE)
  • The xyplot() function has a suitable method for time series objects
    • The function to create the time-series plot is simply xyplot()
    • Instead of a formula and a data frame, the only mandatory argument is a time series object, which must be the first argument
    • The default value of type is “l”, so that data points are joined by lines
    • The argument superpose, which can take values TRUE or FALSE, is used to control whether multiple time series are plotted within the same panel or in separate panels, respectively
    • The default is to plot them separately
    • The argument cut, which should be a list of the form list(number = , overlap = ), is used to produce so-called “cut-and-stack” plots, by splitting the time axis into multiple overlapping periods which are then used to condition
    • This makes it easier to see parts of a long series
  • One innovative display design for time series data, known as horizon graphs, is implemented in the panel.horizonplot() function in the latticeExtra package
    • Horizon plots allow you to visualize many time series in a small amount of space
    • The main motivation for this design is to reduce the vertical space occupied by a single time series, without the loss of resolution that would result from simply flattening the usual line graph display
    • This is achieved in two ways. First, negative values are mirrored to lie above the x-axis, but distinguished from positive values by shading using different colors
    • Second, values are divided into bands with progressively higher saturation, and the bands are collapsed to wrap them around lower bands

New high-level functions can be created:

  • Completely new high-level functions are built when the panel options are insufficient
    • The horizonplot() for above is one example
    • The chloropleth (colored map) is another - see mapplot() in the latticeExtra() package
  • Since the earth is three dimensional but the plot is two dimensional, a projection is required to reduce the number of dimensions
    • The list of available projections is given in the Details section of the mapproject() help page
  • Map plots are drawn in two stages. First, a map object is created using the map() function from the maps package with plot = FALSE
    • the_map <- map(“a_map_dataset”, plot = FALSE, projection = “some_projection”)
  • Second, mapplot() is called with a formula, a data frame, and a map
    • mapplot(region ~ value, data, map = the_map)
  • It is common to have statistical estimates in the form of confidence intervals in addition to point estimates
    • Such data can be displayed using segment plots via the segplot() function in the latticeExtra package
      segplot(
    • categories ~ lower_limit + upper_limit, data = some_data, centers = point_estimates)
    • Notice that the categories are displayed on the y-axis, and the confidence intervals are displayed on the x-axis
    • The point estimates, usually a mean or median value for that category, are specified using the centers argument, not the formula
    • An optional argument, draw.bands, let’s you choose between confidence bands and confidence intervals
    • This argument is passed to the default panel function panel.segplot()
  • One common approach is to plot some form of bivariate density estimate instead of the raw data, as is done with histograms and kernel density plots for univariate data
    • Hexagonal binning and plotting is implemented in the R package hexbin, which also includes the high-level function hexbinplot() for creating conditional hexbin plots using the lattice framework
    • The formula and data argument in a hexbinplot() call is interpreted in the same way as xyplot()
    • The type argument can be set to “r” to add a regression line
    • The trans argument can be a function that is applied to the observed counts before creating bands for different colors
    • By default, the range of counts is divided up evenly into bands, but taking the square root of the counts, for example, emphasizes differences in the lower range of counts more
    • The inv argument gives the inverse function of trans, so that transformed counts can be converted back before being shown in the legend

Manipulation (extension) of trellis objects:

  • latticeExtra::useOuterStrip(latticeObject) will make the strips show only on the top and the left
  • The directlabels package tackles an interesting problem: instead of having a separate legend associating graphical parameters and levels of a grouping variable, it tries to indicate the grouping by placing text labels within the panel
    • This is generally a tricky thing to do automatically. directlabels relies on heuristics, and also allows the user to provide their own heuristics. It works with both lattice and ggplot2 plots
  • Once a lattice plot object is created, it can be modified using the update() method
    • Among other things, a new panel function can be provided as the panel argument, to change or enhance the panel display
    • Specifying the display in the form of a function can be cumbersome, especially for minor changes
    • An alternative approach, implemented in the latticeExtra package, is to add so-called layers to the existing display. This is modeled on the approach used by the ggplot2 package
  • There are two kinds of layers
    • Layers that go below the default display (i.e., are drawn before it) are created by the layer_() function
    • Those that go above are created using layer()
    • There are also corresponding versions glayer_() and glayer() for grouped displays
    • A layer is created by putting a function call, as it would appear inside a panel function, inside a call to layer_() or layer()
  • Suppose you want to create a layer with a call to panel.grid that goes under the display, and a call to panel.lmline() that goes above, and then add it to an existing lattice plot p
    • under_layer <- layer_(panel.grid())
    • over_layer <- layer(panel.lmline(x, y))
    • p + under_layer + over_layer
  • Layers are added to a plot using the + operator

Example code includes:

# Use 'EuStockMarkets' time series data
data(EuStockMarkets)
str(EuStockMarkets)
##  Time-Series [1:1860, 1:4] from 1991 to 1999: 1629 1614 1607 1621 1618 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "DAX" "SMI" "CAC" "FTSE"
# Create time series plot
xyplot(EuStockMarkets, 
       # Plot all series together
       superpose = TRUE,
       # Split up the time axis into parts
       cut = list(number = 3, overlap = 0.25))

# Create time series plot
xyplot(EuStockMarkets,
       # Specify panel function
       panel=panel.horizonplot,
       # Specify prepanel function
       prepanel=prepanel.horizonplot)

# Load required packages
library(maps)


# Create map object for US counties
county.map <- map("county", plot = FALSE, fill = TRUE, 
                  # Specify projection
                  projection = "sinusoidal")

# Create choropleth map
row.names(USCancerRates) <- rn_USCR

mapplot(row.names(USCancerRates) ~ log10(rate.male) + log10(rate.female), 
        data = USCancerRates, 
        xlab = "", scales = list(draw = FALSE),
        # Specify map
        map = county.map)
## Warning in (function (x, y, map, breaks, colramp, exact = FALSE,
## lwd = 0.5, : 64 unmatched regions: alaska,nome, alaska,wade hampton,
## alaska,haines, alaska,....
## Warning in (function (x, y, map, breaks, colramp, exact = FALSE,
## lwd = 0.5, : 64 unmatched regions: alaska,nome, alaska,wade hampton,
## alaska,haines, alaska,....

# Create subset for Louisiana
LACancerRates1 <- filter(USCancerRates, state == "Louisiana")
str(LACancerRates1)
## 'data.frame':    64 obs. of  11 variables:
##  $ rate.male       : num  369 361 349 338 338 ...
##  $ LCL95.male      : num  316 289 302 308 303 ...
##  $ UCL95.male      : num  428 446 402 372 376 ...
##  $ rate.female     : num  162 193 215 194 192 ...
##  $ LCL95.female    : num  134 150 184 176 170 ...
##  $ UCL95.female    : num  196 246 250 215 218 ...
##  $ state           : Factor w/ 49 levels "Alabama","Alaska",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ county          :Class 'AsIs'  chr [1:64] "Richland Parish" "Madison Parish" "De Soto Parish" "St. Bernard Parish" ...
##  $ state.ordered   : Factor w/ 49 levels "Utah","New Mexico",..: 46 46 46 46 46 46 46 46 46 46 ...
##   ..- attr(*, "scores")= num [1:49(1d)] 166 166 145 169 159 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ division        : Factor w/ 9 levels "New England",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ division.ordered: Factor w/ 9 levels "Mountain","West North Central",..: 7 7 7 7 7 7 7 7 7 7 ...
##   ..- attr(*, "scores")= num [1:9(1d)] 417 416 446 466 433 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "New England" "Middle Atlantic" "South Atlantic" "East South Central" ...
# Reorder levels of county
LACancerRates2 <- 
    mutate(LACancerRates1, 
           county = reorder(county, rate.male))

# Draw confidence intervals
segplot(county ~ LCL95.male + UCL95.male,
        data = LACancerRates2,
        # Add point estimates
        centers = rate.male,
        # Draw segments rather than bands
        draw.bands = FALSE)

# The 'USCancerRates' dataset
str(USCancerRates)
## 'data.frame':    3041 obs. of  11 variables:
##  $ rate.male       : num  364 346 341 336 330 ...
##  $ LCL95.male      : num  311 274 304 289 293 ...
##  $ UCL95.male      : num  423 431 381 389 371 ...
##  $ rate.female     : num  151 140 182 185 172 ...
##  $ LCL95.female    : num  124 103 161 157 151 ...
##  $ UCL95.female    : num  184 190 206 218 195 ...
##  $ state           : Factor w/ 49 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ county          :Class 'AsIs'  chr [1:3041] "Pickens County" "Bullock County" "Russell County" "Barbour County" ...
##  $ state.ordered   : Factor w/ 49 levels "Utah","New Mexico",..: 25 25 25 25 25 25 25 25 25 25 ...
##   ..- attr(*, "scores")= num [1:49(1d)] 166 166 145 169 159 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ division        : Factor w/ 9 levels "New England",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ division.ordered: Factor w/ 9 levels "Mountain","West North Central",..: 9 9 9 9 9 9 9 9 9 9 ...
##   ..- attr(*, "scores")= num [1:9(1d)] 417 416 446 466 433 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "New England" "Middle Atlantic" "South Atlantic" "East South Central" ...
# Load the 'hexbin' package 
library(hexbin)

# Create hexbin plot
hexbinplot(rate.female ~ rate.male, 
           data = USCancerRates, 
           # Add a regression line
           type = "r",
           # function to transform counts
           trans = sqrt,
           # function to invert transformed counts
           inv = function(x) x^2
           )

# Load the 'directlabels' package
library(directlabels)

# Use the 'airquality' dataset
str(airquality)
## 'data.frame':    153 obs. of  7 variables:
##  $ Ozone     : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R   : int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind      : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp      : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month     : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Month.Name: Factor w/ 12 levels "January","February",..: 5 5 5 5 5 5 5 5 5 5 ...
# Create factor variable
airquality$Month.Name <- 
    factor(month.name[airquality$Month], levels = month.name)

# Create density plot object
tplot2 <- 
    densityplot(~ Ozone + Temp, data = airquality, 
                # Variables should go in different panels
                outer = TRUE,
                # Specify grouping variable
                groups = Month.Name,
                # Suppress display of data points
                plot.points = FALSE, 
                # Add reference line
                ref = TRUE,
                # Specify layout
                layout = c(2, 1),
                # Omit strip labels
                strip = FALSE,
                # Provide column-specific x-axis labels
                xlab = c("Ozone (ppb)", "Temperature (F)"),
                # Let panels have independent scales 
                scales = list(relation="free"))

# Produce plot with direct labels
direct.label(tplot2)

# 'USCancerRates' is pre-loaded
str(USCancerRates)
## 'data.frame':    3041 obs. of  11 variables:
##  $ rate.male       : num  364 346 341 336 330 ...
##  $ LCL95.male      : num  311 274 304 289 293 ...
##  $ UCL95.male      : num  423 431 381 389 371 ...
##  $ rate.female     : num  151 140 182 185 172 ...
##  $ LCL95.female    : num  124 103 161 157 151 ...
##  $ UCL95.female    : num  184 190 206 218 195 ...
##  $ state           : Factor w/ 49 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ county          :Class 'AsIs'  chr [1:3041] "Pickens County" "Bullock County" "Russell County" "Barbour County" ...
##  $ state.ordered   : Factor w/ 49 levels "Utah","New Mexico",..: 25 25 25 25 25 25 25 25 25 25 ...
##   ..- attr(*, "scores")= num [1:49(1d)] 166 166 145 169 159 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ division        : Factor w/ 9 levels "New England",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ division.ordered: Factor w/ 9 levels "Mountain","West North Central",..: 9 9 9 9 9 9 9 9 9 9 ...
##   ..- attr(*, "scores")= num [1:9(1d)] 417 416 446 466 433 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "New England" "Middle Atlantic" "South Atlantic" "East South Central" ...
# Create scatter plot
p <- xyplot(rate.female ~ rate.male, data = USCancerRates, 
            # Change plotting character
            pch = 16, 
            # Make points semi-transparent
            alpha = 0.25)

# Create layer with reference grid
l0 <- layer_(panel.grid())

# Create layer with reference line
l1 <- layer(panel.abline(0, 1))

# Create layer with regression fit
l2 <- layer(panel.smoother(x, y, method="lm"))

# Combine and plot
p + l0 + l1 + l2


Visualizing Time Series Data in R

Chapter 1 - R Time Series Visualization Tools

Refresher on xts and the plot() function:

  • With a time series plot, each element is associated to a specific time
  • The xts objects is typically the storage mechanism for times series data in R
    • Time Index (Date, POSIXct, or the like) + Matrix
  • The plot() call can be used on xts objects and will call plot.xts() to achieve this purpose
    • Many of the calls are similar to a normal plot() - for example, can overwrite using lines()

Other useful visualizing functions:

  • Can use lines() to add a line to an existing time series plot
  • Can use axis(side=, at=) # 1 bottom, 2 left, 3 top, 4 right ; can use at=pretty(existingPlotData)
  • Can add legends using legend(x=, legend=, col=, lty=)
  • Can add lines to a plot using abline(v=, h=)
  • The PerformanceAnalytics package allows for better highlighting portions of the plot
  • To highlight a specific period in a time series, you can display it in the plot in a different background color
    • The chart.TimeSeries() function in the PerformanceAnalytics package offers a very easy and flexible way of doing this
    • chart.TimeSeries(R, period.areas, period.color)
    • R is an xts, time series, or zoo object of asset returns
    • period.areas are shaded areas specified by a start and end date in a vector of xts date ranges like c(“1926-10/1927-11”)
    • period.color draws the shaded region in whichever color is specified

Example code includes:

library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
# data is a 504x4 xts object of Yahoo, Microsoft, Citigroup, and Dow
tmpData <- readr::read_delim("./RInputFiles/dataset_1_1.csv", delim=" ")
## Parsed with column specification:
## cols(
##   Index = col_date(format = ""),
##   yahoo = col_double(),
##   microsoft = col_double(),
##   citigroup = col_double(),
##   dow_chemical = col_double()
## )
data <- xts::xts(tmpData[, -1], order.by=as.POSIXct(tmpData$Index))


# Display the first few lines of the data
head(data)
##                     yahoo microsoft citigroup dow_chemical
## 2015-01-01 18:00:00 50.17  44.30501  53.45259     42.48209
## 2015-01-04 18:00:00 49.13  43.89759  51.76803     41.16821
## 2015-01-05 18:00:00 49.21  43.25329  49.94556     40.50662
## 2015-01-06 18:00:00 48.59  43.80284  50.40857     40.44139
## 2015-01-07 18:00:00 50.23  45.09144  51.16711     41.44776
## 2015-01-08 18:00:00 49.72  44.71244  50.02437     41.38253
# Display the column names of the data
colnames(data)
## [1] "yahoo"        "microsoft"    "citigroup"    "dow_chemical"
# Plot yahoo data and add title
plot(data[, "yahoo"], main="yahoo")

# Replot yahoo data with labels for X and Y axes
plot(data[, "yahoo"], main="yahoo", xlab="date", ylab="price")

# Note that type="h" is for bars
# Plot the second time series and change title
plot(data[, 2], main="microsoft")

# Replot with same title, add subtitle, use bars
plot(data[, 2], main="microsoft", sub="Daily closing price since 2015", type="h")

# Change line color to red
lines(data[, 2], col="red")

# Plot two charts on same graphical window
par(mfrow = c(2, 1))
plot(data[, 1], main="yahoo")
plot(data[, 2], main="microsoft")

# Replot with reduced margin and character sizes
par(mfrow = c(2, 1), mex=0.6, cex=0.8)
plot(data[, 1], main="yahoo")
plot(data[, 2], main="microsoft")

par(mfrow = c(1, 1), mex=1, cex=1)


# Plot the "microsoft" series
plot(data[, "microsoft"], main="Stock prices since 2015")

# Add the "dow_chemical" series in red
lines(data[, "dow_chemical"], col="red")

# Add a Y axis on the right side of the chart
axis(side=4, at=pretty(data[, "dow_chemical"]))

# Add a legend in the bottom right corner
legend("bottomright", legend=c("microsoft", "dow_chemical"), col=c("black", "red"), lty=c(1, 1))

# Plot the "citigroup" time series
plot(data[, "citigroup"], main="Citigroup")

# Create vert_line to identify January 4th, 2016 in citigroup
vert_line <- which(index(data[, "citigroup"]) == as.POSIXct("2016-01-04"))

# Add a red vertical line using vert_line
abline(v = .index(data[, "citigroup"])[vert_line], col = "red")

# Create hori_line to identify average price of citigroup
hori_line <- mean(data[, "citigroup"])

# Add a blue horizontal line using hori_line
abline(h = hori_line, col = "blue")

# Create period to hold the 3 months of 2015
period <- c("2015-01/2015-03")

# Highlight the first three months of 2015 
PerformanceAnalytics::chart.TimeSeries(data[, "citigroup"], period.areas=period)

# Highlight the first three months of 2015 in light grey
PerformanceAnalytics::chart.TimeSeries(data[, "citigroup"], period.areas=period, period.color="lightgrey")

# Plot the microsoft series
plot(data[, "microsoft"], main="Dividend date and amount")

# Add the citigroup series
lines(data[, "citigroup"], col="orange", lwd=2)

# Add a new y axis for the citigroup series
axis(side=4, at=pretty(data[, "citigroup"]), col="orange")

micro_div_date <- "15 Nov. 2016"
citi_div_date <- "13 Nov. 2016"
micro_div_value <- "$0.39"
citi_div_value <- "$0.16"
# Same plot as the previous exercise
plot(data$microsoft, main = "Dividend date and amount")

lines(data$citigroup, col = "orange", lwd = 2)
axis(side = 4, at = pretty(data$citigroup), col = "orange")

# Create the two legend strings
micro <- paste0("Microsoft div. of ", micro_div_value," on ", micro_div_date)
citi <- paste0("Citigroup div. of ", citi_div_value," on ", citi_div_date)

# Create the legend in the bottom right corner
legend(x = "bottomright", legend = c(micro, citi), col = c("black", "orange"), lty = c(1, 1))

data_1_1_old <- data

Chapter 2 - Univariate Time Series

Univariate time series analysis - deals with only a single variable:

  • Location, Dispersion, Distribution - frequently presented by way of histograms
  • Time series typically need to be transformed prior to these calculations, since their data is in the wrong format otherwise
    • For example, it is often more helpful to get the distribution of price change (and/or percentage return) rather than just the stock price
  • In finance, price series are often transformed to differenced data, making it a return series
    • In R, the ROC() (which stands for “Rate of Change”) function from the TTR package does this automatically to a price or volume series x

Other visualization tools:

  • Can create histograms of stock returns
  • Can use boxplot() to see the box-and-whisker of the stock returns
    • The argument horizontal=TRUE will display the block horizontally
  • Can run acf() to see the autocorrelation of the returns
  • Can run qqnorm() and qqline() to see whether the data are normally distributed

Combining everything so far:

  • The histogram helps with understanding both central tendencies and outliers
    • The box and whiskers plot helps in a similar manner - also helps to show investment riskiness
  • The autocorrelation plot helps with understanding the linkages between today and days in the future
  • The QQ plot helps to assess whether methods/tests that rely on normality can be safely used on the dataset

Example code includes:

tmpData <- readr::read_delim("./RInputFiles/dataset_2_1.csv", delim=" ")
## Parsed with column specification:
## cols(
##   Index = col_date(format = ""),
##   Apple = col_double()
## )
names(tmpData) <- c("Index", "apple")
data <- xts::xts(tmpData[, -1], order.by=as.Date(tmpData$Index))
# indexClass(data) <- c("POSIXt", "POSIXlt")

# Plot Apple's stock price 
plot(data[, "apple"], main="Apple stock price")

# Create a time series called rtn
rtn <- TTR::ROC(data[, "apple"])

# Plot Apple daily price and daily returns 
par(mfrow=c(1, 2))
plot(data[, "apple"], main="Apple stock price")
plot(rtn)

par(mfrow=c(1, 1))


dim(rtn)
## [1] 522   1
rtn <- rtn[complete.cases(rtn), ]
dim(rtn)
## [1] 521   1
# Create a histogram of Apple stock returns
hist(rtn, main="Apple stock return distribution", probability=TRUE)

# Add a density line
lines(density(rtn[complete.cases(rtn), ]))

# Redraw a thicker, red density line
lines(density(rtn[complete.cases(rtn), ]), col="red", lwd=2)

rtnRaw <- as.double(rtn$apple)

# Draw box and whisker plot for the Apple returns
boxplot(rtnRaw)

# Draw a box and whisker plot of a normal distribution
boxplot(rnorm(1000))

# Redraw both plots on the same graphical window
par(mfrow=c(2, 1))
boxplot(rtnRaw, horizontal=TRUE)
boxplot(rnorm(1000), horizontal=TRUE)

par(mfrow=c(1, 1))


# Draw autocorrelation plot
acf(rtn, main="Apple return autocorrelation")

# Redraw with a maximum lag of 10
acf(rtn, main="Apple return autocorrelation", lag.max=10)

# Create q-q plot
qqnorm(rtn, main="Apple return QQ-plot")

# Add a red line showing normality
qqline(rtn, col="red")

par(mfrow=c(2, 2))

hist(rtn, probability=TRUE)
lines(density(rtn), col="red")
boxplot(rtnRaw)
acf(rtn)
qqnorm(rtn)
qqline(rtn, col="red")

par(mfrow=c(1, 1))

Chapter 3 - Multivariate Time Series

Dealing with higher dimensions - visualization challenges with larger numbers of series:

  • Might want to compare stock prices vs interest rate changes
  • Cannot easily visualize even 10 time series, let alone 100 time series
  • One solution is to plot both time series as barcharts. There are two types:
    • Grouped barchart: for a single period, there are as many bars as time series
    • Stacked bar chart: for each period, there is a single bar, and each time series is represented by a portion of the bar proportional to the value of the time series at this date (i.e. the total at each period adds up to 100%)

Multivariate time series:

  • To create a stacked chart, use barchart(myFrame, col=c(), main=) # can specify the desired colors in the barchart or use the defaults
  • Can create the correlation matrix using cor(myMatrix, digit=)
    • Several types of correlations exist but the most used ones are:
    • Pearson correlation: measures the linear relationship between 2 variables
    • Spearman rank correlation: measures the statistical dependency between the ranking of 2 variables (not necessarily linear)
  • Can create the pair chart using pairs(myFrame, lower.panel=NULL, main=) # the lower.panel=NULL shows only the diagonal and the upper-right of the pairs plot
  • Can create a correlation plot using corrplot(myMatrix, method=“number”, type=“upper”) # type=“upper” shows only the upper-right of the diagonal

Higher dimension time series:

  • Can display a correlation matrix as a heat map
    • corrplot(myMatrix, method=“color”, type=“upper”)

Example code includes:

# You are provided with a dataset (portfolio) containing the weigths of stocks A (stocka) and B (stockb) in your portfolio for each month in 2016
stockA <- c(0.1, 0.4, 0.5, 0.5, 0.2, 0.3, 0.7, 0.8, 0.7, 0.2, 0.1, 0.2)
stockB <- c(0.9, 0.6, 0.5, 0.5, 0.8, 0.7, 0.3, 0.2, 0.3, 0.8, 0.9, 0.8)
pDates <- as.Date(c('2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01', '2016-05-01', '2016-06-01', '2016-07-01', '2016-08-01', '2016-09-01', '2016-10-01', '2016-11-01', '2016-12-01'))
portfolio <- xts(data.frame(stocka=stockA, stockb=stockB), order.by=pDates)

# Plot stacked barplot
barplot(portfolio)

# Plot grouped barplot
barplot(portfolio, beside=TRUE)

tmpData <- readr::read_delim("./RInputFiles/data_3_2.csv", delim=",")
## Parsed with column specification:
## cols(
##   Index = col_date(format = ""),
##   sp500 = col_double(),
##   citigroup = col_double(),
##   microsoft = col_double(),
##   apple = col_double(),
##   dowchemical = col_double(),
##   yahoo = col_double()
## )
# names(tmpData) <- c("Index", "apple")
my_data <- xts::xts(tmpData[, -1], order.by=as.Date(tmpData$Index))


citi <- as.numeric(my_data$citigroup)
sp500 <- as.numeric(my_data$sp500)

# Draw the scatterplot
plot(y=citi, x=sp500)

# Draw a regression line
abline(reg=lm(citi ~ sp500), col="red", lwd=2)

# my_data containing the returns for 5 stocks: ExxonMobile, Citigroup, Microsoft, Dow Chemical and Yahoo
# Create correlation matrix using Pearson method
cor(my_data)
##                 sp500 citigroup microsoft     apple dowchemical     yahoo
## sp500       1.0000000 0.5097953 0.3743215 0.3576966   0.5217243 0.2900962
## citigroup   0.5097953 1.0000000 0.4841408 0.4291841   0.5085190 0.4029490
## microsoft   0.3743215 0.4841408 1.0000000 0.5133469   0.3954523 0.4329388
## apple       0.3576966 0.4291841 0.5133469 1.0000000   0.3627755 0.3413626
## dowchemical 0.5217243 0.5085190 0.3954523 0.3627755   1.0000000 0.2938749
## yahoo       0.2900962 0.4029490 0.4329388 0.3413626   0.2938749 1.0000000
# Create correlation matrix using Spearman method
cor(my_data, method="spearman")
##                 sp500 citigroup microsoft     apple dowchemical     yahoo
## sp500       1.0000000 0.5192579 0.4244237 0.3518853   0.5316235 0.3262037
## citigroup   0.5192579 1.0000000 0.4976477 0.4374850   0.5607511 0.3780730
## microsoft   0.4244237 0.4976477 1.0000000 0.5128477   0.4684114 0.4448179
## apple       0.3518853 0.4374850 0.5128477 1.0000000   0.3681791 0.3680715
## dowchemical 0.5316235 0.5607511 0.4684114 0.3681791   1.0000000 0.3464743
## yahoo       0.3262037 0.3780730 0.4448179 0.3680715   0.3464743 1.0000000
# Create scatterplot matrix
pairs(as.data.frame(my_data))

# Create upper panel scatterplot matrix
pairs(as.data.frame(my_data), lower.panel=NULL)

cor_mat <- cor(my_data)

# In this exercise, you will use the provided correlation matrix cor_mat
# Create correlation matrix
corrplot::corrplot(cor_mat)

# Create correlation matrix with numbers
corrplot::corrplot(cor_mat, method="number")

# Create correlation matrix with colors
corrplot::corrplot(cor_mat, method="color")

# Create upper triangle correlation matrix
corrplot::corrplot(cor_mat, method="number", type="upper")

# Draw heatmap of cor_mat
corrplot::corrplot(cor_mat, method="color")

# Draw upper heatmap
corrplot::corrplot(cor_mat, method="color", type="upper")

# Draw the upper heatmap with hclust
corrplot::corrplot(cor_mat, method="color", type="upper", order="hclust")


Chapter 4 - Case Study: Stock Picking for Portfolios

Case study presentation:

  • Suppose you have a portfolio of Apple, Microsoft, and Yahoo
  • Suppose also that you can add just a single extra stock with some spare cash
  • Examine the correlations of new stocks to the existing portfolio
    • Starting point assumption is capital protection - low correlation to the existing portfolio
  • The PerformanceAnalytics package has some helpful tools for this analysis

New stocks:

  • Goal is to choose the best new stock for the portfolio
  • The PerformanceAnalytics package provides additional tools to get a finer view of your portfolio
    • In particular, the charts.PerformanceSummary() function provides a quick and easy way to display the portfolio value, returns and periods of poor performance, also known as drawdowns

Course conclusion:

  • xts, plot()
  • Univariate
  • Multivariate
  • Case study

Example code includes:

# In this exercise, you are provided with a dataset data containing the value and the return of the portfolio over time, in value and return, respectively.

tmpData <- readr::read_delim("./RInputFiles/data_4_1.csv", delim=",")
## Parsed with column specification:
## cols(
##   Index = col_date(format = ""),
##   value = col_double(),
##   return = col_double()
## )
# names(tmpData) <- c("Index", "apple")
data <- xts::xts(tmpData[, -1], order.by=as.Date(tmpData$Index))
# indexClass(data) <- c("POSIXt", "POSIXlt")


# Plot the portfolio value
plot(data$value, main="Portfolio Value")

# Plot the portfolio return
plot(data$return, main="Portfolio Return")

# Plot a histogram of portfolio return 
hist(data$return, probability=TRUE)

# Add a density line
lines(density(data$return), col="red", lwd=2)

tmpPortfolioData <- data


# The new dataset data containing four new stocks is available in your workspace: Goldman Sachs (GS), Coca-Cola (KO), Walt Disney (DIS), Caterpillar (CAT)

tmpData <- readr::read_delim("./RInputFiles/data_4_3.csv", delim=",")
## Parsed with column specification:
## cols(
##   Index = col_date(format = ""),
##   GS = col_double(),
##   KO = col_double(),
##   DIS = col_double(),
##   CAT = col_double()
## )
# names(tmpData) <- c("Index", "apple")
data <- xts::xts(tmpData[, -1], order.by=as.Date(tmpData$Index))
# indexClass(data) <- c("POSIXt", "POSIXlt")


# Plot the four stocks on the same graphical window
par(mfrow=c(2, 2), mex=0.8, cex=0.8)
plot(data[, 1])
plot(data[, 2])
plot(data[, 3])
plot(data[, 4])

par(mfrow=c(1, 1), mex=1, cex=1)


# In this exercise, you are provided with four individual series containing the return of the same four stocks:
# gs, ko, dis, cat
# Solution makes absolutely no sense


portfolio <- as.numeric(tmpPortfolioData$return)
gs <- as.numeric(TTR::ROC(data[, "GS"]))[-1]
ko <- as.numeric(TTR::ROC(data[, "KO"]))[-1]
dis <- as.numeric(TTR::ROC(data[, "DIS"]))[-1]
cat <- as.numeric(TTR::ROC(data[, "CAT"]))[-1]


# Draw the scatterplot of gs against the portfolio
plot(y=portfolio, x=gs)

# Add a regression line in red
abline(reg=lm(gs ~ portfolio), col="red", lwd=2)


# Plot scatterplots and regression lines to a 2x2 window
par(mfrow=c(2, 2))

plot(y=portfolio, x=gs)
abline(reg=lm(gs ~ portfolio), col="red", lwd=2)

plot(y=portfolio, x=ko)
abline(reg=lm(ko ~ portfolio), col="red", lwd=2)

plot(y=portfolio, x=dis)
abline(reg=lm(dis ~ portfolio), col="red", lwd=2)

plot(y=portfolio, x=cat)
abline(reg=lm(cat ~ portfolio), col="red", lwd=2)

par(mfrow=c(1, 1))


# In this exercise, you are given a dataset old.vs.new.portfolio with the following self-explanatory columns: old.portfolio.value, new.portfolio.value, old.portfolio.rtn, new.portfolio.rtn
tmpData <- readr::read_delim("./RInputFiles/old.vs.new.portfolio.csv", delim=",")
## Parsed with column specification:
## cols(
##   Index = col_date(format = ""),
##   old.portfolio.value = col_double(),
##   new.portfolio.value = col_double(),
##   old.portfolio.rtn = col_double(),
##   new.portfolio.rtn = col_double()
## )
# names(tmpData) <- c("Index", "apple")
old.vs.new.portfolio <- xts::xts(tmpData[, -1], order.by=as.Date(tmpData$Index))
# indexClass(data) <- c("POSIXt", "POSIXlt")


# Plot new and old portfolio values on same chart
plot(old.vs.new.portfolio$old.portfolio.value)

lines(old.vs.new.portfolio$new.portfolio.value, col = "red")

# Plot density of the new and old portfolio returns on same chart
plot(density(old.vs.new.portfolio$old.portfolio.rtn))
lines(density(old.vs.new.portfolio$new.portfolio.rtn), col ="red")

# Draw value, return, drawdowns of old portfolio
PerformanceAnalytics::charts.PerformanceSummary(old.vs.new.portfolio[, "old.portfolio.rtn"])

# Draw value, return, drawdowns of new portfolio
PerformanceAnalytics::charts.PerformanceSummary(old.vs.new.portfolio[, "new.portfolio.rtn"])

# Draw both portfolios on same chart
# Draw value, return, drawdowns of new portfolio
PerformanceAnalytics::charts.PerformanceSummary(old.vs.new.portfolio[, c("old.portfolio.rtn", "new.portfolio.rtn")])


Communicating with the Tidyverse

Chapter 1 - Custom ggplot2 themes

Introduction to the data - finding stories in datasets:

  • Communication is the final step in the tidyverse workflow
  • This course will create a production-level plot from Swiss public radio regarding hours worked in Europe, using ggplot2
  • Will also create a report of the findings using R Markdown
  • This course will work with two datasets from the ILO (International Labor Organization)
    • ilo_working_hours - country-year-working_hours
    • ilo_hourly_compensation - coutry-year-hourly_compensation
  • Begin by integrating the data using dplyr::inner_join
  • Usually, categorical variables like country in this example should be converted to factors before plotting them
    • You can do so using as.factor(). In your data set, two columns are still of type “character” – use mutate() to turn them into factors

Filtering and plotting the data:

  • The filter() function can be used to maintain only the European countries - best for the key years of interest
    • The %in% operator will be valuable for this, given a vector of countries in Europe
  • Will look at histograms, scatter-plots, titling, and the like
  • Will use group_by() and summarize() also for looking at tabular results

Custom ggplot2 themes - providing a custom look to a chart:

  • Custom looks can make it easier to highlight key data - colors, emphasis, shading, etc.
  • The theme() function is added to a function just like anything else in a ggplot
    • text=element_text(family=, color=) # to make a specific family and color available for all the labels and text
  • Can also add default ggplot2 themes to a plot
  • Can chain themes, including a default theme followed by several overrides
    • theme_classic() + theme(text=element_text(family=, color=)
  • Can get an overview of all the possible options by using ?theme
  • There are four key members of the element_* function family
    • element_text()
    • element_rect()
    • element_line()
    • element_blank() - makes plot elements disappear

Example code includes:

library(ggplot2)

load("./RInputFiles/ilo_hourly_compensation.RData")
load("./RInputFiles/ilo_working_hours.RData")


# Join both data frames
ilo_data <- ilo_hourly_compensation %>%
  inner_join(ilo_working_hours, by = c("country", "year"))

# Count the resulting rows
ilo_data  %>% 
    count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   612
# Examine ilo_data
ilo_data
## # A tibble: 612 x 4
##    country     year   hourly_compensation working_hours
##    <chr>       <chr>                <dbl>         <dbl>
##  1 Australia   1980.0               8.44           34.6
##  2 Canada      1980.0               8.87           34.8
##  3 Denmark     1980.0              10.8            31.9
##  4 Finland     1980.0               8.61           35.6
##  5 France      1980.0               8.90           35.4
##  6 Italy       1980.0               8.09           35.7
##  7 Japan       1980.0               5.46           40.8
##  8 Korea, Rep. 1980.0               0.950          55.3
##  9 Norway      1980.0              11.8            30.4
## 10 Spain       1980.0               5.86           36.8
## # ... with 602 more rows
# Turn year into a factor
ilo_data <- ilo_data %>%
  mutate(year = as.factor(as.numeric(year)))

# Turn country into a factor
ilo_data <- ilo_data %>%
  mutate(country = as.factor(country))


# Examine the European countries vector
european_countries <- c('Finland', 'France', 'Italy', 'Norway', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom', 'Belgium', 'Ireland', 'Luxembourg', 'Portugal', 'Netherlands', 'Germany', 'Hungary', 'Austria', 'Czech Rep.')
european_countries
##  [1] "Finland"        "France"         "Italy"          "Norway"        
##  [5] "Spain"          "Sweden"         "Switzerland"    "United Kingdom"
##  [9] "Belgium"        "Ireland"        "Luxembourg"     "Portugal"      
## [13] "Netherlands"    "Germany"        "Hungary"        "Austria"       
## [17] "Czech Rep."
# Only retain European countries
ilo_data <- ilo_data %>%
  filter(country %in% european_countries)

# Examine the structure of ilo_data
str(ilo_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    380 obs. of  4 variables:
##  $ country            : Factor w/ 30 levels "Australia","Austria",..: 8 9 15 22 25 27 28 29 8 9 ...
##  $ year               : Factor w/ 27 levels "1980","1981",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ hourly_compensation: num  8.61 8.9 8.09 11.8 5.86 ...
##  $ working_hours      : num  35.6 35.4 35.7 30.4 36.8 ...
# Group and summarize the data
ilo_data %>%
  group_by(year) %>%
  summarize(mean_hourly_compensation = mean(hourly_compensation),
            mean_working_hours = mean(working_hours))
## # A tibble: 27 x 3
##    year  mean_hourly_compensation mean_working_hours
##    <fct>                    <dbl>              <dbl>
##  1 1980                      9.27               34.0
##  2 1981                      8.69               33.6
##  3 1982                      8.36               33.5
##  4 1983                      7.81               33.9
##  5 1984                      7.54               33.7
##  6 1985                      7.79               33.7
##  7 1986                      9.70               34.0
##  8 1987                     12.1                33.6
##  9 1988                     13.2                33.7
## 10 1989                     13.1                33.5
## # ... with 17 more rows
# Filter for 2006
plot_data <- ilo_data %>%
  filter(year == 2006)
  
# Create the scatter plot
ggplot(plot_data) +
  geom_point(aes(x = working_hours, y = hourly_compensation))

# Create the plot
ggplot(plot_data) +
  geom_point(aes(x = working_hours, y = hourly_compensation)) +
  # Add labels
  labs(
    x = "Working hours per week",
    y = "Hourly compensation",
    title = "The more people work, the less compensation they seem to receive",
    subtitle = "Working hours and hourly compensation in European countries, 2006",
    caption = "Data source: ILO, 2017"
  )

# Save your current plot into a variable: ilo_plot
ilo_plot <- ggplot(plot_data) +
  geom_point(aes(x = working_hours, y = hourly_compensation)) +
  labs(
    x = "Working hours per week",
    y = "Hourly compensation",
    title = "The more people work, the less compensation they seem to receive",
    subtitle = "Working hours and hourly compensation in European countries, 2006",
    caption = "Data source: ILO, 2017"
  )
  
# Try out theme_minimal
ilo_plot +
  theme_minimal()

# Try out any other possible theme function
ilo_plot +
  theme_linedraw()

windowsFonts(Bookman=windowsFont("Bookman Old Style"))

ilo_plot <- ilo_plot +
  theme_minimal() +
  # Customize the "minimal" theme with another custom "theme" call
  theme(
    text = element_text(family = "Bookman"),
    title = element_text(color = "gray25"),
    plot.subtitle = element_text(size=12),
    plot.caption = element_text(color = "gray30")
  )

# Render the plot object
ilo_plot

ilo_plot +
  # "theme" calls can be stacked upon each other, so this is already the third call of "theme"
  theme(
    plot.background = element_rect(fill = "gray95"),
    plot.margin = unit(c(5, 10, 5, 10), units = "mm")
  )


Chapter 2 - Creating Custom and Unique Visualization

Visualizing aspects of data with facets:

  • The facet_grid() function builds on the facet_wrap() concept, allowing for further control
    • facet_grid(rowVar ~ colVar)
    • Note that facet_grid(. ~ year) will give the same output as facet_wrap(~ year)
  • Theme options are available for faceted plots - strip.backgroumd, strip.text, etc.
  • Can also create your own theme functions, such as:
    • theme_green <- function(){ theme( plot.background = element_rect(fill = “green”), panel.background = element_rect(fill = “lightgreen”) ) }

Custom plot to emphasize change:

  • The dot plot is useful for comparing change over time
    • Dot for starting point, arrow pointint towards ending point, text labels at start and end of arrow, arranged so that country (or whatever) is along the y-axis
  • The default geom_dotplot() is NOT what is needed - this is a histogram using dots rather than bars
  • Instead, the geom_path() is available for connecting observations in the order in which they appear in the data (so, proper ordering of the data frame is VERY important!)
    • The geom_path(aes(x=, y=), arrow=arrow()) will expect at least one numeric variable, and one variable (y) that is either numeric or factor
    • The arrow() is a function that allows for calling a specific type of arrow, arrow head, and the like

Polishing the dot plot:

  • Ordering the factors can help make things much clearer in the ggplot - ggplot defaults to using the factor levels
  • The library(forcats) is great for working with factor variables, and is part of the tidyverse
    • fct_drop for dropping levels
    • fct_rev for reversing factor levels
    • fct_reorder for reordering factor levels
  • The arguments for fct_reorder(factorVar, dataVar, FUN) - frequently applied by way of a mutate() call
  • Can further use the hjust and vjust aesthetics to nudge the labels for better readability
    • These are added inside the aes() call for geom_text() and can be like aes(…, hjust=ifelse(year == 2006, 1.4, -0.4))

Finalizing plots for different audiences and devices:

  • Changing the viewport (zooming or repositioning) can be managed in any of two manners
    • coord_cartesian(xlim=c(), ylim=c()) is the default ggplot2 mechanism
    • The difference with using coord_cartesian rather than direct +xlim() + ylim() is that coord_cartesian() will prevent clipping, which is generally preferred
  • Need to customize the plot for mobile devices
    • Can be helpful to have the plot available in 16:9 aspect ratio, which nicely fits most smartphones
    • Can also be helpful to kill off axes, and put any labels needed directly in to the data
  • In this exercise, you’re going to encounter something that is probably new to you
    • New data sets can be given to single geometries like geom_text(), so these geometries don’t use the data set given to the initial ggplot() call
    • In this exercise, you are going to need this because you only want to add one label to each arrow
    • If you were to use the original data set ilo_data, two labels would be added because there are two observations for each country in the data set, one for 1996 and one for 2006

Example code includes:

# Filter ilo_data to retain the years 1996 and 1996
ilo_data <- ilo_data %>%
  filter(year == 1996 | year == 2006)


# Again, you save the plot object into a variable so you can save typing later on
ilo_plot <- ggplot(ilo_data, aes(x = working_hours, y = hourly_compensation)) +
  geom_point() +
   labs(
    x = "Working hours per week",
    y = "Hourly compensation",
    title = "The more people work, the less compensation they seem to receive",
    subtitle = "Working hours and hourly compensation in European countries, 2006",
    caption = "Data source: ILO, 2017"
  ) +
  # Add facets here
  facet_grid(facets = . ~ year)
 
ilo_plot

# For a starter, let's look at what you did before: adding various theme calls to your plot object
ilo_plot +
  theme_minimal() +
  theme(
    text = element_text(family = "Bookman", color = "gray25"),
    plot.subtitle = element_text(size = 12),
    plot.caption = element_text(color = "gray30"),
    plot.background = element_rect(fill = "gray95"),
    plot.margin = unit(c(5, 10, 5, 10), units = "mm")
  )

# Define your own theme function below
theme_ilo <- function() {
  theme_minimal() +
  theme(
    text = element_text(family = "Bookman", color = "gray25"),
    plot.subtitle = element_text(size = 12),
    plot.caption = element_text(color = "gray30"),
    plot.background = element_rect(fill = "gray95"),
    plot.margin = unit(c(5, 10, 5, 10), units = "mm"))
}


# Apply your theme function
ilo_plot <- ilo_plot + theme_ilo()

# Examine ilo_plot
ilo_plot

ilo_plot +
  # Add another theme call
  theme(
    # Change the background fill to make it a bit darker
    strip.background = element_rect(fill = "gray60", color = "gray95"),
    # Make text a bit bigger and change its color to white
    strip.text = element_text(size = 11, color = "white")
  )

# Create the dot plot
ggplot(ilo_data) +
    geom_path(aes(x=working_hours, y=country))

ggplot(ilo_data) +
  geom_path(aes(x = working_hours, y = country),
  # Add an arrow to each path
            arrow = arrow(length = unit(1.5, "mm"), type = "closed"))

ggplot(ilo_data) +
  geom_path(aes(x = working_hours, y = country),
            arrow = arrow(length = unit(1.5, "mm"), type = "closed")) +
  # Add a geom_text() geometry
  geom_text(
          aes(x = working_hours,
              y = country,
              label = round(working_hours, 1))
        )

library(forcats)

# Reorder country factor levels
ilo_data <- ilo_data %>%
  # Arrange data frame
  arrange(country, year) %>%
  # Reorder countries by working hours in 2006
  mutate(country = fct_reorder(country,
                               working_hours,
                               last))

# Plot again
ggplot(ilo_data) +
  geom_path(aes(x = working_hours, y = country),
            arrow = arrow(length = unit(1.5, "mm"), type = "closed")) +
    geom_text(
          aes(x = working_hours,
              y = country,
              label = round(working_hours, 1))
          )

# Save plot into an object for reuse
ilo_dot_plot <- ggplot(ilo_data) +
  geom_path(aes(x = working_hours, y = country),
            arrow = arrow(length = unit(1.5, "mm"), type = "closed")) +
    # Specify the hjust aesthetic with a conditional value
    geom_text(
          aes(x = working_hours,
              y = country,
              label = round(working_hours, 1),
              hjust = ifelse(year == "2006", 1.4, -0.4)
            ),
          # Change the appearance of the text
          size = 3,
          family = "Bookman",
          color = "gray25"
          )

ilo_dot_plot

# Reuse ilo_dot_plot
ilo_dot_plot <- ilo_dot_plot +
  # Add labels to the plot
  labs(
    x = "Working hours per week",
    y = "Country",
    title = "People work less in 2006 compared to 1996",
    subtitle = "Working hours in European countries, development since 1996",
    caption = "Data source: ILO, 2017"
  ) +
  # Apply your theme
  theme_ilo() +
  # Change the viewport
  coord_cartesian(xlim = c(25, 41))
  
# View the plot
ilo_dot_plot

# Compute temporary data set for optimal label placement
median_working_hours <- ilo_data %>%
  group_by(country) %>%
  summarize(median_working_hours_per_country = median(working_hours)) %>%
  ungroup()

# Have a look at the structure of this data set
str(median_working_hours)
## Classes 'tbl_df', 'tbl' and 'data.frame':    17 obs. of  2 variables:
##  $ country                         : Factor w/ 30 levels "Netherlands",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ median_working_hours_per_country: num  27 27.8 28.4 31 30.9 ...
ilo_dot_plot +
  # Add label for country
  geom_text(data = median_working_hours,
            aes(y = country,
                x = median_working_hours_per_country,
                label = country),
            vjust = -0.5,
            size=3,
            family = "Bookman",
            color = "gray25") +
  # Remove axes and grids
  theme(
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    panel.grid = element_blank(),
    # Also, let's reduce the font size of the subtitle
    plot.subtitle = element_text(size = 9)
  )


Chapter 3 - Introduction to R Markdown

What is R Markdown?

  • R Markdown is a framework for converting R code in to a wide range of outputs - html, PDF, etc.
    • R Markdown -> knitr -> md -> pandoc -> final output
  • The biggest benefit of R Markdown is the full reproducibility of the analysis
    • Other people or yourself (if there is new data)
  • The code needs to be executable on other people’s machines, and the data should either be included (or have a link to it)
    • These are the minimum standards for reproducibility
    • Fuller standards would include software/package systems, run date/time, operating system, etc.

Formatting with R Markdown:

  • Markdown is an example of a mark-up language (like html, which is a hyper-text mark-up language)
    • Markdown was created to have a quick means of creating valid html code that could be published on the internet
  • The hash signs (#) are for levels of header - # (first), ## (second), etc.
  • The asterisk will make italics with singles (myItalicText) and bold with doubles (myBoldText)
  • Links can be introduced and named using the target name inside square brackets followed by the target link inside parentheses myLinkName
  • The R Markdown document is a mix of R code and Markdown code
    • R Markdown uses pandoc markdown, part of the markdown family
    • Can use the pandoc markdown reference guide, available at R Studio

R Code in R Markdown Documents:

  • Adding R chunks is as simple as adding triple back-ticks, followed by {r}, ended with triple back-ticks
  • Can also add R code inline such as back-tick r 2 + 2 back-tick, such as 2 + 2 equals 4
  • There are many options available for R code chunks including
    • include=FALSE # execute the R code but do not quote it or print any output
    • echo=FALSE # do not show the R code, but show its output
    • message=FALSE # do not show messages
    • warning=FALSE # do not show warnings
    • eval=FALSE # do not evaluate the R chunk (but do print it provided the default echo=TRUE is set)
  • R code chunks can also be named
    • This makes sense in large documents, especially if there is an error inside a chunk
    • While knitting a document in RStudio, for example, the error can then be located in an easier fashion
    • Chunk options are added after the name of the chunk and a comma, like so: {r name, option=value}

Images in R Markdown Files:

  • Images resulting from code are responsive, which is to say that they will change with the page size
    • Sometimes, the default options that go with a figure are sub-optimal (wrong aspect ratio or the like)
    • Can add options like fig.height=6 inside the ```{r} command to address these - default unit is inches
    • Also can use fig.width= (inches) and fig.align= (any of “right”, “left”, “center”)
  • Can also load external images in to Markdown
    • exclamation-mark square-brackets-containing-name parentheses-containing-image-location

Example code is contained in the summary Excel worksheet.


Chapter 4 - Customizing R Markdown Reports

Advanced YAML Settings (YAML is a recursive name meaning YAML and Markup Language):

  • YAML documents typically start and end with three hyphens (—) with value: key pairs
    • Indentations suggest sub-family relationships; spacing does not matter, but everything of the same level must be indented the same
  • All R Markdown documents begin with a YAML header, which can be customized and enhanced
  • You add a table with toc: true and specify whether it should be floating (= whether it moves along as you scroll) with toc_float
    • The depth of your TOC is set with toc_depth
  • Before you dig deeper into custom stylesheets, let’s enable code folding with code_folding: ….
    • This will allow your readers to completely hide all code chunks or show them – all at once or individually

Custom stylesheets - creating a unique theme for a report:

  • Can refer to any CSS (cascading style sheet) in the YAML header
  • Can use any of the html tags that would be created by the document, and enhance the properties they will have for this html document output
    • h2 { font-family: “Bookman”, serif; }
    • Conclude each rule with a semicolon!
    • body, h1, h2 { font-family: “Bookman”, serif; }
    • Separate the html tags with commas
    • a { color: #0000FF; font-weight: bold; }
    • Separate the commands with a semicolon (same as used to end the commands)
  • There are some further customization possibilities
    • strong { color: “blue”; } # this will make everything of tag “strong” blue
    • strong.red { color: “red”; } # can create a strong.red tag that will be “red” even while the rest of them are “blue”
  • It is also possible to specify combinators in CSS, which is to say that tags within another tag only are impacted
    • div strong { color: “green”; } # strong tags subordinate at any number of levels to div tags will be colored green
    • div > strong { color: “red”; } # strong tags directly subordinate (pure child-parent relationship) to div will be colored red
  • Mozilla Developer Network has a lot of style tag ideas

Beautiful tables:

  • By default, R Markdown renders tables exactly as they would be rendered to the R console
  • Can add the df_print: key-value under the html_document: (or whatever) area
    • Typically key-value are either df_print: kable or df_print: paged
  • Alternately, to just change a single table, pipe the output to knitr::kable()
    • myData %>% group_by(myFactor) %>% summarize(myEquations) %>% knitr::kable()
  • Tables can also be styled using html tags - basic anatomy of a table includes
    • … …
    • Each of the header and the body will have one or more rows, each depicted using …
    • Each row of the header is Column1 Column 2 . . .
    • Each row of the body is Cell1 Cell 2 . . .
  • Add %>% pull(n) (from dplyr) to the inline R statement in the “Data” section, so its output is not rendered as a table
    • pull() extracts single columns out of data frames

Summary:

  • Course summarized the final component of the tidyverse process - communication is key!
  • Switzerland demographic map
  • Can show population density using geom_line()

Example code is contained in the summary Excel worksheet.


Foundations of Probability in R

Chapter 1 - Binomial Distribution

Flipping coins in R - for example, rbinom(1, 1, 0.5) - 1 draw of 1 coint with 50% of being heads:

  • Generally, interpretation of 1 is heads
  • rbinom(nDraws, nPerDraw, pPerDraw) - can generate multiple simulations at the same time
  • Frequent focus in this course will be on biased coins - pPerDraw != 0.5

Density and cumulative density:

  • Histogram on a simulation can be a helpful way for understanding densities and likelihoods
  • With a known distribution, can get the exact answer using dbinom(nHit, nDraw, pPerDraw)
  • The “cumulative density” is the probability of getting this value or less
    • pbinom(nHit, nDraw, pPerDraw) # gives the cumulative probability of nHit or ferwer hits when making nDraw draws each at probability pPerDraw

Expected value and variance:

  • Two interesting characteristics are the expected value and the variance of the distribution
  • The theoretical mean for the binomial is easy to calculate from the parameters
    • mean = n * p
  • The theroretical variance (mean-squared distance from the mean) for the binomial is also easy to calculate from the parameters
    • variance = n * p * (1 - p)

Example code includes:

# Generate 10 separate random flips with probability .3
rbinom(10, 1, 0.3)
##  [1] 0 1 0 0 1 1 0 0 1 0
# Generate 100 occurrences of flipping 10 coins, each with 30% probability
rbinom(100, 10, 0.3)
##   [1] 2 1 6 4 2 3 3 6 8 5 1 1 3 7 1 5 4 6 4 3 4 2 4 2 4 1 2 5 1 7 2 5 2 5 3
##  [36] 4 5 2 3 3 0 4 3 3 5 2 4 1 2 3 2 1 4 5 4 0 5 6 5 2 1 2 3 2 2 4 2 5 3 5
##  [71] 3 4 1 2 4 1 3 2 6 3 4 2 4 6 6 2 2 2 4 6 4 4 2 1 4 3 0 4 3 3
# Calculate the probability that 2 are heads using dbinom
dbinom(2, 10, 0.3)
## [1] 0.2334744
# Confirm your answer with a simulation using rbinom
mean(rbinom(10000, 10, 0.3) == 2)
## [1] 0.2353
# Calculate the probability that at least five coins are heads
1 - pbinom(4, 10, 0.3)
## [1] 0.1502683
# Confirm your answer with a simulation of 10,000 trials
mean(rbinom(10000, 10, 0.3) >= 5)
## [1] 0.1533
# Here is how you computed the answer in the last problem
mean(rbinom(10000, 10, .3) >= 5)
## [1] 0.149
# Try now with 100, 1000, 10,000, and 100,000 trials
mean(rbinom(100, 10, .3) >= 5)
## [1] 0.16
mean(rbinom(1000, 10, .3) >= 5)
## [1] 0.158
mean(rbinom(10000, 10, .3) >= 5)
## [1] 0.1518
mean(rbinom(100000, 10, .3) >= 5)
## [1] 0.15187
# Calculate the expected value using the exact formula
25 * 0.3
## [1] 7.5
# Confirm with a simulation using rbinom
mean(rbinom(10000, 25, 0.3))
## [1] 7.4447
# Calculate the variance using the exact formula
25 * 0.3 * (1 - 0.3)
## [1] 5.25
# Confirm with a simulation using rbinom
var(rbinom(10000, 25, 0.3))
## [1] 5.15845

Chapter 2 - Laws of Probability

Probability of Event A and Event B:

  • Suppose there are two independent events, possibly with different probabilities, A and B
    • P(A and B) = P(A) * P(B) # assuming A and B are independent, as assumed throughout this chapter
  • If there are two boolean vectors, A and B, then A & B will give a single boolean vector that is the “and” on each pair of elements

Probability of A or B:

  • P(A or B) = P(A) + P(B) - P(A and B)
    • Alternately, P(A or B) = 1 - P(notA and notB)
    • Can also use the mean(A | B) assuming that A and B are boolean vectors of the same length

Multiplying random variables:

  • Suppose that you already have a variable X with a known mean and variance
    • mean(a * X) = a * mean(X)
    • var(a * X) = a^2 * var(X)

Adding random variables:

  • Suppose that you already have random variables X and Y with known means and variances
    • mean(X + Y) = mean(X) + mean(Y) # does not require independence
    • var(X + Y) = var(X) + var(Y) # requires independence

Example code includes:

# Simulate 100,000 flips of a coin with a 40% chance of heads
A <- rbinom(100000, 1, 0.4)

# Simulate 100,000 flips of a coin with a 20% chance of heads
B <- rbinom(100000, 1, 0.2)

# Estimate the probability both A and B are heads
mean(A & B)
## [1] 0.0805
# You've already simulated 100,000 flips of coins A and B
A <- rbinom(100000, 1, .4)
B <- rbinom(100000, 1, .2)

# Simulate 100,000 flips of coin C (70% chance of heads)
C <- rbinom(100000, 1, .7)

# Estimate the probability A, B, and C are all heads
mean(A & B & C)
## [1] 0.05589
# Simulate 100,000 flips of a coin with a 60% chance of heads
A <- rbinom(100000, 1, 0.6)

# Simulate 100,000 flips of a coin with a 10% chance of heads
B <- rbinom(100000, 1, 0.1)

# Estimate the probability either A or B is heads
mean(A | B)
## [1] 0.63736
# Use rbinom to simulate 100,000 draws from each of X and Y
X <- rbinom(100000, 10, 0.6)
Y <- rbinom(100000, 10, 0.7)

# Estimate the probability either X or Y is <= to 4
mean((X <= 4) | (Y <= 4))
## [1] 0.20613
# Use pbinom to calculate the probabilities separately
prob_X_less <- pbinom(4, 10, 0.6)
prob_Y_less <- pbinom(4, 10, 0.7)

# Combine these to calculate the exact probability either <= 4
prob_X_less + prob_Y_less - prob_X_less * prob_Y_less
## [1] 0.2057164
# Simulate 100,000 draws of a binomial with size 20 and p = .1
X <- rbinom(100000, 20, 0.1)

# Estimate the expected value of X
mean(X)
## [1] 1.9991
# Estimate the expected value of 5 * X
mean(5 * X)
## [1] 9.9955
# Estimate the variance of X
var(X)
## [1] 1.786197
# Estimate the variance of 5 * X
var(5 * X)
## [1] 44.65493
# Simulate 100,000 draws of X (size 20, p = .3) and Y (size 40, p = .1)
X <- rbinom(100000, 20, 0.3)
Y <- rbinom(100000, 40, 0.1)

# Estimate the expected value of X + Y
mean(X + Y)
## [1] 9.99048
# Find the variance of X + Y
var(X + Y)
## [1] 7.798627
# Find the variance of 3 * X + Y
var(3 * X + Y)
## [1] 41.20331

Chapter 3 - Bayesian Statistics

Updating with evidence:

  • Probability of A given B -> P(A | B) = P(A and B) / P(B)

Prior probability - may not be equal odds prior to seeing any evidence:

  • The prior probability is the belief in the probabilities prior to seeing any evidence
  • Can just simulate the relative sizes - for example, if there is a 9:1 prior, simulate 90,000 vs. 10,000 before finding conditional probability

Bayes theorem:

  • Basically multiply prior probability for A with likelihood of seeing event (density) if A
    • Repeat for B, C, …
    • Scale multiplied probabilities to add to one, and those are the posterior probabilities
    • Pr(A|B) = P(A and B) / P(B)
  • The more generalized Bayes theory is
    • Numer = P(B|A) * P(A)
    • Denom = P(B|A) * P(A) + P(B | notA) * P(notA)
    • P(A|B) = Numer / Denom

Example code includes:

# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom(50000, 20, 0.5)
biased <- rbinom(50000, 20, 0.75)

# How many fair cases, and how many biased, led to exactly 11 heads?
fair_11 <- sum(fair == 11)
biased_11 <- sum(biased == 11)

# Find the fraction of fair coins that are 11 out of all coins that were 11
fair_11 / (fair_11 + biased_11)
## [1] 0.8487457
# How many fair cases, and how many biased, led to exactly 16 heads?
fair_16 <- sum(fair == 16)
biased_16 <- sum(biased == 16)

# Find the fraction of fair coins that are 16 out of all coins that were 16
fair_16 / (fair_16 + biased_16)
## [1] 0.02418033
# Simulate 8000 cases of flipping a fair coin, and 2000 of a biased coin
fair_flips <- rbinom(8000, 20, 0.5)
biased_flips <- rbinom(2000, 20, 0.75)

# Find the number of cases from each coin that resulted in 14/20
fair_14 <- sum(fair_flips == 14)
biased_14 <- sum(biased_flips == 14)

# Use these to estimate the posterior probability
fair_14 / (fair_14 + biased_14)
## [1] 0.4651515
# Simulate 80,000 draws from fair coin, 10,000 from each of high and low coins
flips_fair <- rbinom(80000, 20, 0.5)
flips_high <- rbinom(10000, 20, 0.75)
flips_low <- rbinom(10000, 20, 0.25)

# Compute the number of coins that resulted in 14 heads from each of these piles
fair_14 <- sum(flips_fair == 14)
high_14 <- sum(flips_high == 14)
low_14 <- sum(flips_low == 14)

# Compute the posterior probability that the coin was fair
fair_14 / (fair_14 + high_14 + low_14)
## [1] 0.6370197
# Use dbinom to calculate the probability of 11/20 heads with fair or biased coin
probability_fair <- dbinom(11, 20, 0.5)
probability_biased <- dbinom(11, 20, 0.75)

# Calculate the posterior probability that the coin is fair
probability_fair / (probability_fair + probability_biased)
## [1] 0.8554755
# Find the probability that a coin resulting in 14/20 is fair
probability_fair <- dbinom(14, 20, .5)
probability_biased <- dbinom(14, 20, .75)
probability_fair / (probability_fair + probability_biased)
## [1] 0.179811
# Find the probability that a coin resulting in 18/20 is fair
probability_fair <- dbinom(18, 20, .5)
probability_biased <- dbinom(18, 20, .75)
probability_fair / (probability_fair + probability_biased)
## [1] 0.002699252
# Use dbinom to find the probability of 16/20 from a fair or biased coin
probability_16_fair <- dbinom(16, 20, 0.5)
probability_16_biased <- dbinom(16, 20, 0.75)

# Use Bayes' theorem to find the posterior probability that the coin is fair
(probability_16_fair * 0.99) / (probability_16_fair * 0.99 + probability_16_biased * 0.01)
## [1] 0.7068775

Chapter 4 - Related Distributions

Normal distribution - symmetrical bell curve, Gaussian:

  • The normal distribution can be defined by mean and standard deviation (or mean and variance)
  • Can simulate from the normal distribution with rnorm(n, mean, sd)

Poisson distribution - approximates the binomial under the assumption of a large number of trials each with a low probability:

  • The Poisson distribution is described only by its mean, lambda
    • Basically, lambda is nDraw * pPerDraw
    • The variance of the Poisson distribution is equal to the mean
  • The Poisson distribution is best for modeling rare events where you really just care about counts (not proportions of a total potential universe)
  • One of the useful properties of the Poisson distribution is that when you add multiple Poisson distributions together, the result is also a Poisson distribution

Geometric distribution - example of flipping a coin with probability p and assessing when the first success occurs:

  • The replicate() function is basically a wrapper to sapply() and can be helpful for simulations like this
    • replicate(10, which(binom(100, 1, 0.1) == 1)[1])
  • The rgeom(nDraws, prob) will give back the geometric distribution
    • The mean will be 1/prob - 1 since it is is the number of trials “before” the first success
    • The mean would be 1/prob if instead the question is the number of trials to get the first success

Example code includes:

compare_histograms <- function(variable1, variable2) {
  x <- data.frame(value = variable1, variable = "Variable 1")
  y <- data.frame(value = variable2, variable = "Variable 2")
  ggplot(rbind(x, y), aes(value)) +
    geom_histogram() +
    facet_wrap(~ variable, nrow = 2)
}


# Draw a random sample of 100,000 from the Binomial(1000, .2) distribution
binom_sample <- rbinom(100000, 1000, 0.2)

# Draw a random sample of 100,000 from the normal approximation
normal_sample <- rnorm(100000, 200, sqrt(160))

# Compare the two distributions with the compare_histograms function
compare_histograms(binom_sample, normal_sample)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Use binom_sample to estimate the probability of <= 190 heads
mean(binom_sample <= 190)
## [1] 0.2292
# Use normal_sample to estimate the probability of <= 190 heads
mean(normal_sample <= 190)
## [1] 0.2152
# Calculate the probability of <= 190 heads with pbinom
pbinom(190, 1000, 0.2)
## [1] 0.2273564
# Calculate the probability of <= 190 heads with pnorm
pnorm(190, 200, sqrt(160))
## [1] 0.2145977
# Draw a random sample of 100,000 from the Binomial(10, .2) distribution
binom_sample <- rbinom(100000, 10, 0.2)

# Draw a random sample of 100,000 from the normal approximation
normal_sample <- rnorm(100000, 2, sqrt(1.6))

# Compare the two distributions with the compare_histograms function
compare_histograms(binom_sample, normal_sample)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Draw a random sample of 100,000 from the Binomial(1000, .002) distribution
binom_sample <- rbinom(100000, 1000, 0.002)

# Draw a random sample of 100,000 from the Poisson approximation
poisson_sample <- rpois(100000, 2)

# Compare the two distributions with the compare_histograms function
compare_histograms(binom_sample, poisson_sample)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Find the percentage of simulated values that are 0
mean(poisson_sample == 0)
## [1] 0.13513
# Use dpois to find the exact probability that a draw is 0
dpois(0, 2)
## [1] 0.1353353
# Simulate 100,000 draws from Poisson(1)
X <- rpois(100000, 1)

# Simulate 100,000 draws from Poisson(2)
Y <- rpois(100000, 2)

# Add X and Y together to create Z
Z <- X + Y

# Use compare_histograms to compare Z to the Poisson(3)
compare_histograms(Z, rpois(100000, 3))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Simulate 100 instances of flipping a 20% coin
flips <- rbinom(100, 1, 0.2)

# Use which to find the first case of 1 ("heads")
which(flips == 1)[1]
## [1] 6
# Existing code for finding the first instance of heads
which(rbinom(100, 1, .2) == 1)[1]
## [1] 5
# Replicate this 100,000 times using replicate()
replications <- replicate(100000, which(rbinom(100, 1, .2) == 1)[1])

# Histogram the replications with qplot
qplot(replications)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Generate 100,000 draws from the corresponding geometric distribution
geom_sample <- rgeom(100000, 0.2)

# Compare the two distributions with compare_histograms
compare_histograms(replications, geom_sample)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Find the probability the machine breaks on 5th day or earlier
pgeom(4, 0.1)
## [1] 0.40951
# Find the probability the machine is still working on 20th day
1 - pgeom(19, 0.1)
## [1] 0.1215767
# Calculate the probability of machine working on day 1-30
still_working <- 1 - pgeom(0:29, 0.1)

# Plot the probability for days 1 to 30
qplot(1:30, still_working)


Inference for Numerical Data

Chapter 1 - Bootstrapping for Parameter Estimates

Introduction - beginning with bootstrapping approach:

  • Example of 20 random apartment rents available in Manhattan
    • Median is the best statistic
  • Bootstrap comes from the phrase “pulling yourself up by the bootstraps” (doing the impossible without any help)
    • Take many random samples with replacement of the same length as the sample data, take their medians, and find the summary statistics about the median
    • The bootstrap distribution is like multiple samples from the sample population
  • Can run bootstraps from the infer package, for example
    • myData %>% infer::specify(response=) %>% infer::generate(reps=, type=“bootstrap”) %>% infer::calculate(stat=“”)

Percentile and standard error methods:

  • Sampling with replacement allows for each item in the sample to potentially be in the population many more times
  • Can describe a bootstrap statistic using a CI, such as the 95th percentile
  • A more accurate calculation is typically to use the standard error approach
    • sample statistic +/- t(df=n-1) * SEboot

Re-centering bootstrap distributions for hypothesis testing:

  • Simulation methods to test whether a bootstrap parameter is less than, different than, or greater than a critical value
  • There is a multi-step process that includes
    • Bootstrap distribution is centered around the same statistics to begin with
    • Since we are now assuming Ho to be true, we shift the bootstrap distribution right/left as needed so that this default is true
    • The p-value is then the number of observations that are at least as favorable to the alternate hypothesis as the observed sample statistic

Example code includes:

manhattan <- readr::read_csv("./RInputFiles/manhattan.csv")
## Parsed with column specification:
## cols(
##   rent = col_integer()
## )
# Will need to either call library(infer) or add infer:: to this code
library(infer)

# Generate bootstrap distribution of medians
rent_ci_med <- manhattan %>%
  # Specify the variable of interest
  specify(response = rent) %>%  
  # Generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap") %>% 
  # Calculate the median of each bootstrap sample
  calculate(stat = "median")

# View the structure of rent_ci_med
str(rent_ci_med)
## Classes 'tbl_df', 'tbl' and 'data.frame':    15000 obs. of  2 variables:
##  $ replicate: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ stat     : num  2422 2350 2262 2325 2350 ...
##  - attr(*, "response")= symbol rent
# Plot a histogram of rent_ci_med
ggplot(rent_ci_med, aes(x=stat)) +
  geom_histogram(binwidth=50)

# Percentile method
rent_ci_med %>%
  summarize(l = quantile(stat, 0.025),
            u = quantile(stat, 0.975))
## # A tibble: 1 x 2
##       l     u
##   <dbl> <dbl>
## 1  2162  2875
# Standard error method

# Calculate observed median
rent_med_obs <- manhattan %>%
  # Calculate observed median rent
  summarize(median(rent)) %>%     
  # Extract numerical value
  pull()

# Determine critical value
t_star <- qt(0.975, df = nrow(manhattan) - 1)

# Construct interval
rent_ci_med %>%
  summarize(boot_se = sd(rent_ci_med$stat)) %>%
  summarize(l = rent_med_obs - t_star * boot_se,
            u = rent_med_obs + t_star * boot_se)
## # A tibble: 1 x 2
##       l     u
##   <dbl> <dbl>
## 1  1994  2706
data(ncbirths, package="openintro")
str(ncbirths)
## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
# Remove NA visits
ncbirths_complete_visits <- ncbirths %>%
  filter(!is.na(visits))
  
# Generate 15000 bootstrap means
visit_ci_mean <- ncbirths_complete_visits %>%
  specify(response=visits) %>%
  generate(reps=15000, type="bootstrap") %>%
  calculate(stat="mean")
  
# Calculate the 90% CI via percentile method
visit_ci_mean %>%
  summarize(l = quantile(stat, 0.05),
            u = quantile(stat, 0.95))
## # A tibble: 1 x 2
##       l     u
##   <dbl> <dbl>
## 1  11.9  12.3
# Calculate 15000 bootstrap SDs
visit_ci_sd <- ncbirths_complete_visits %>%
  specify(response=visits) %>%
  generate(reps=15000, type="bootstrap") %>%
  calculate(stat="sd")

# Calculate the 90% CI via percentile method
visit_ci_sd %>%
  summarize(l = quantile(stat, 0.05),
            u = quantile(stat, 0.95))
## # A tibble: 1 x 2
##       l     u
##   <dbl> <dbl>
## 1  3.74  4.16
# Generate 15000 bootstrap samples centered at null
rent_med_ht <- manhattan %>%
  specify(response = rent) %>%
  hypothesize(null = "point", med = 2500) %>% 
  generate(reps = 15000, type = "bootstrap") %>% 
  calculate(stat = "median")
  
# Calculate observed median
rent_med_obs <- manhattan %>%
  summarize(median(rent)) %>%
  pull()

# Calculate p-value
rent_med_ht %>%
  filter(stat > rent_med_obs) %>%
  summarize(n() / 15000)
## # A tibble: 1 x 1
##   `n()/15000`
##         <dbl>
## 1       0.948
# Generate 1500 bootstrap means centered at null
weight_mean_ht <- ncbirths %>%
  specify(response = weight) %>%
  hypothesize(null = "point", mu = 7) %>% 
  generate(reps=1500, type="bootstrap") %>% 
  calculate(stat="mean")
  
# Calculate observed mean
weight_mean_obs <- ncbirths %>%
  summarize(mean(weight)) %>%
  pull()

# Calculate p-value
weight_mean_ht %>%
  filter(stat > weight_mean_obs) %>%
  summarize((n()/1500) * 2)
## # A tibble: 1 x 1
##   `(n()/1500) * 2`
##              <dbl>
## 1           0.0253

Chapter 2 - Introducing the t-distribution

The t-distribution - especially useful when the population standard deviation is unknown (as is typically the case):

  • The t-distribution is like the normal distribution, but with thicker tails
    • Observations are more likely to be 2+ SD from the mean using the t-distribution than with the normal distribution
    • The t-distribution is always centered at zero, and has a single parameter, degrees of freedom
  • As the degrees of freedom go to infinite, the t-distribution becomes the normal distribution
    • Can always use the t-distribution, though
  • We can use the pt function to find probabilities under the t-distribution
    • For a given cutoff value q and a given degrees of freedom df, pt(q, df) gives us the probability under the t-distribution with df degrees of freedom for values of t less than q
    • In other words, P(tdf<T)P(tdf<T) = pt(q = T, df)
  • We can use the qt() function to find cutoffs under the t-distribution
    • For a given probability p and a given degrees of freedom df, qt(p, df) gives us the cutoff value for the t-distribution with df degrees of freedom for which the probability under the curve is p
    • In other words, if P(tdf<T)=pP(tdf<T)=p, then TT = qt(p, df)
    • For example, if TT corresponds to the 95th percentile of a distribution, p=0.95p=0.95

Estimating a mean with a t-interval:

  • Quantifying the expected variability of sample means - theory (CLM)
  • The Central Limit Theorem (CLM) states that the sample mean will be normal with population mean and appropriate standard error (population sigma divided by sqrt(n) where n is the sample size)
    • Since we do not have the original population, we never really have the population sigma
    • However, the standard error is frequently estimated as the sample standard deviation divided by the square root of the sample size
    • We use a t-distribution with df=n-1 to account for the extra uncertainty
  • The CLM has some key assumptions that must be validated first
    • Independence of observations - hard to check, but typically assumed when the sampling methodology is appropriate
    • The more skewed the original population, the larger sample size that is needed
  • The function t.test(myVar, conf.level=) will generate a confidence interval for the mean of myVar, as well as a p-value for the mean being non-zero

The t-interval for paired data:

  • Examples would be same student taking two tests - this means the data are NOT independent, but instead they are paired
    • Can be helpful in these cases to create a variable diff which is the difference in test scores by student
    • Can then just run the normal t-test on the differences

Testing a mean with a t-test:

  • Can run t.test(myVar, mu=myNullValue, alternative=“two.sided”) to run a two-sided t-test for mean(myVar) != myNullValue
    • Will provide a p-value as well as a 95% CI for the mean of myVar

Example code includes:

# P(T < 3) for df = 10
(x <- pt(3, df = 10))
## [1] 0.9933282
# P(T > 3) for df = 10
(y <- 1 - pt(3, df=10))
## [1] 0.006671828
# P(T > 3) for df = 100
(z <- 1 - pt(3, df=100))
## [1] 0.001703958
# Comparison
y == z
## [1] FALSE
y > z
## [1] TRUE
y < z
## [1] FALSE
# 95th percentile for df = 10
(x <- qt(0.95, df = 10))
## [1] 1.812461
# upper bound of middle 95th percent for df = 10
(y <- qt(0.975, df = 10))
## [1] 2.228139
# upper bound of middle 95th percent for df = 100
(z <- qt(0.975, df = 100))
## [1] 1.983972
# Comparison
y == z
## [1] FALSE
y > z
## [1] TRUE
y < z
## [1] FALSE
data(acs12, package="openintro")

# Subset for employed respondents
acs12_emp <- acs12 %>%
  filter(employment == "employed")

# Construct 95% CI for avg time_to_work
t.test(acs12_emp$time_to_work, conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  acs12_emp$time_to_work
## t = 32.635, df = 782, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  24.43369 27.56120
## sample estimates:
## mean of x 
##  25.99745
t.test(acs12_emp$hrs_work, conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  acs12_emp$hrs_work
## t = 87.521, df = 842, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  38.05811 39.80429
## sample estimates:
## mean of x 
##   38.9312
data(textbooks, package="openintro")

# 90% CI
t.test(textbooks$diff, conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  textbooks$diff
## t = 7.6488, df = 72, p-value = 6.928e-11
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##   9.981505 15.541783
## sample estimates:
## mean of x 
##  12.76164
# 95% CI
t.test(textbooks$diff, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  textbooks$diff
## t = 7.6488, df = 72, p-value = 6.928e-11
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   9.435636 16.087652
## sample estimates:
## mean of x 
##  12.76164
# 99% CI
t.test(textbooks$diff, conf.level = 0.99)
## 
##  One Sample t-test
## 
## data:  textbooks$diff
## t = 7.6488, df = 72, p-value = 6.928e-11
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##   8.347154 17.176133
## sample estimates:
## mean of x 
##  12.76164
# Conduct HT
t.test(textbooks$diff, mu=0, alternative="two.sided", conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  textbooks$diff
## t = 7.6488, df = 72, p-value = 6.928e-11
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   9.435636 16.087652
## sample estimates:
## mean of x 
##  12.76164
# Calculate 15000 bootstrap means
textdiff_med_ci <- textbooks %>%
  specify(response = diff) %>%
  generate(reps=15000, type="bootstrap") %>%
  calculate(stat = "median")
  
# Calculate the 95% CI via percentile method
textdiff_med_ci %>%
  summarize(l=quantile(stat, 0.025), 
            u=quantile(stat, 0.975))
## # A tibble: 1 x 2
##       l     u
##   <dbl> <dbl>
## 1  5.04  11.7
data(hsb2, package="openintro")

# Calculate diff
hsb2 <- hsb2 %>%
  mutate(diff = math - science)
  
# Generate 15000 bootstrap means centered at null
scorediff_med_ht <- hsb2 %>%
  specify(response=diff) %>%
  hypothesize(null="point", mu=0) %>% 
  generate(reps=15000, type="bootstrap") %>% 
  calculate(stat="median")
  
# Calculate observed median of differences
scorediff_med_obs <- hsb2 %>%
  summarize(median(diff)) %>%
  pull()

# Calculate p-value
scorediff_med_ht %>%
  filter(stat > scorediff_med_obs) %>%
  summarize(p_val = (n() / 15000) * 2)
## # A tibble: 1 x 1
##   p_val
##   <dbl>
## 1 0.529

Chapter 3 - Inference for Difference in Two Parameters

Hypothesis testing for comparing two means:

  • Data stem.cell are available in the openintro package - question of whether stem cells help with heart recovery in sheep
    • Question is the impact of test vs. control, with each sheep having change measured, but only some having the stem cell therapy
  • For the hacker statistics approach, can randomly assign the sheep (multiple times) as test vs. control, and plot the ECDF (or similar) of the changes
    • Can then compare how extreme our actual sample is relative to the hacker statitistics simulation
  • The library(infer) is built to help with problems like this
    • library(infer) diff_ht_mean <- stem.cell %>%
    • specify(__) %>% # y ~ x
    • hypothesize(null = __) %>% # “independence” or “point”
    • generate(reps = , type = ) %>% # “bootstrap”, “permute”, or “simulate”
    • calculate(stat=“diff in means”) %>%

  • For problems like this, the null hypothesis is “independence” and the generation type is “permute”

Bootstrap CI for difference in two means:

  • Take a bootstrap sample from each of the two groups
  • Calculate the bootstrap statistic of interest
  • Repeat as needed to calculate a bootstrap interval

Comparing means with a t-test:

  • Looking at the average hourly rate vs. citizenship from the ACS data
    • t.test(hrly_rate ~ citizen, data=acs12, null=0, alternative=“two.sided”)
  • Review of conditions required for the t-test
    • Independence of observations (usually assumed with proper randomization and a sample size that is small relative to the population)
    • Independence of observations across the samples (not paired)
    • Skewed samples require larger sample sizes for the normality approximations to be valid

Example code includes:

data(stem.cell, package="openintro")
str(stem.cell)
## 'data.frame':    18 obs. of  3 variables:
##  $ trmt  : Factor w/ 2 levels "ctrl","esc": 1 1 1 1 1 1 1 1 1 2 ...
##  $ before: num  35.2 36.5 39.8 39.8 41.8 ...
##  $ after : num  29.5 29.5 36.2 38 37.5 ...
# Calculate difference between before and after
stem.cell <- stem.cell %>%
  mutate(change = after - before)

# Calculate observed difference in means
diff_mean <- stem.cell %>%
  # Group by treatment group
  group_by(trmt) %>%       
  # Calculate mean change for each group
  summarize(mean_change = mean(change)) %>%
  # Extract
  pull() %>% 
  # Calculate difference
  diff()                      


# Generate 1000 differences in means via randomization
diff_ht_mean <- stem.cell %>%
  # y ~ x
  specify(change ~ trmt) %>% 
  # Null = no difference between means
  hypothesize(null = "independence") %>% 
  # Shuffle labels 1000 times
  generate(reps = 1000, type = "permute") %>% 
  # Calculate test statistic
  calculate(stat = "diff in means", order=rev(levels(stem.cell$trmt)))

# Calculate p-value
diff_ht_mean %>%
  # Identify simulated test statistics at least as extreme as observed
  filter(stat > diff_mean) %>%
  # Calculate p-value
  summarize(p_val = (n() / 1000))
## # A tibble: 1 x 1
##   p_val
##   <dbl>
## 1     0
# Remove subjects with missing habit
ncbirths_complete_habit <- ncbirths %>%
  filter(!is.na(habit))

# Calculate observed difference in means
diff_mean <- ncbirths_complete_habit %>%
  # Group by habit group
  group_by(habit) %>%
  # Calculate mean weight for each group
  summarize(mean_weight = mean(weight)) %>%
  # Extract
  pull() %>%
  # Calculate difference
  diff()                             
  
# Generate 1000 differences in means via randomization
diff_ht_mean <- ncbirths_complete_habit %>%
  # y ~ x
  specify(weight ~ habit) %>%
  # Null = no difference between means
  hypothesize(null = "independence") %>%  
  # Shuffle labels 1000 times
  generate(reps = 1000, type = "permute") %>%
  # Calculate test statistic
  calculate(stat = "diff in means", order=rev(levels(ncbirths_complete_habit$habit)))

# Calculate p-value
diff_ht_mean %>%
  # Identify simulated test statistics at least as extreme as observed
  filter(stat < diff_mean) %>%
  # Calculate p-value
  summarize(p_val = (n()/1000) * 2)
## # A tibble: 1 x 1
##    p_val
##    <dbl>
## 1 0.0280
# Generate 1500 bootstrap difference in means
diff_mean_ci <- ncbirths_complete_habit %>%
  specify(weight ~ habit) %>%
  generate(reps = 1500, type = "bootstrap") %>%
  calculate(stat = "diff in means", order=rev(levels(ncbirths_complete_habit$habit)))

# Calculate the 95% CI via percentile method
diff_mean_ci %>%
  summarize(l=quantile(stat, 0.025), 
            u=quantile(stat, 0.975))
## # A tibble: 1 x 2
##        l       u
##    <dbl>   <dbl>
## 1 -0.583 -0.0530
# Remove subjects with missing habit and weeks
ncbirths_complete_habit_weeks <- ncbirths %>%
  filter(!is.na(habit) & !is.na(weeks))

# Generate 1500 bootstrap difference in medians
diff_med_ci <- ncbirths_complete_habit_weeks %>%
  specify(weeks ~ habit) %>%
  generate(reps = 1500, type = "bootstrap") %>%
  calculate(stat="diff in medians", order=rev(levels(ncbirths_complete_habit_weeks$habit)))

# Calculate the 92% CI via percentile method
diff_med_ci %>%
  summarize(l=quantile(stat, 0.04), 
            u=quantile(stat, 0.96))
## # A tibble: 1 x 2
##       l     u
##   <dbl> <dbl>
## 1 -1.00     0
# Create hrly_pay and filter for non-missing hrly_pay and citizen
acs12_complete_hrlypay_citizen <- acs12 %>%
  mutate(hrly_pay = income / (hrs_work * 52)) %>%
  filter(
    !is.na(hrly_pay),
    !is.na(citizen)
  )

# Calculate percent missing
new_n <- nrow(acs12_complete_hrlypay_citizen)
old_n <- nrow(acs12)
(perc_missing <- (old_n - new_n) / old_n) 
## [1] 0.5205
# Calculate summary statistics
acs12_complete_hrlypay_citizen %>%
  group_by(citizen) %>%
  summarize(
    x_bar = mean(hrly_pay),
    s = sd(hrly_pay),
    n = n()
  )
## # A tibble: 2 x 4
##   citizen x_bar     s     n
##   <fct>   <dbl> <dbl> <int>
## 1 no       21.2  34.5    58
## 2 yes      18.5  24.7   901
# Plot the distributions
ggplot(data = acs12_complete_hrlypay_citizen, mapping = aes(x = hrly_pay)) +
  geom_histogram(binwidth = 5) +
  facet_grid(. ~ citizen, labeller = labeller(citizen = c(no  = "Non citizen", 
                                                          yes = "Citizen"))) 

# Construct 95% CI
t.test(hrly_pay ~ citizen, data=acs12_complete_hrlypay_citizen, null=0, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  hrly_pay by citizen
## t = 0.58058, df = 60.827, p-value = 0.5637
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.53483 11.88170
## sample estimates:
##  mean in group no mean in group yes 
##          21.19494          18.52151

Chapter 4 - Comparing Many Means

Vocabulary score vary between social class:

  • Data set includes wordsum (vocabular score) and class (lower, working, middle, upper)

ANOVA - Analysis of Variance:

  • Example of runners in a marathon finsihing in different times based on many different factors
  • Suppose that we are interested in a specific variable X (perhaps training time)
    • Variability in finishing time due to X
    • Variability in finishing time due to all factors other than X
  • The null hypothesis is that the means are the same across all of the groups, while the alternate hypothesis is that at least one mean is different
  • Can assess the total variability of vocabulary scores as follows
    • Variability between groups
    • Variability within groups
  • Running aov(x ~ y, data=z) will run ANOVA and report on
    • myVar - between groups df, sumsq, and the like
    • Residuals - within groups df, sumsq, and the like
    • Can also calculate the percentage of variability explained
    • The F-statistic is the key test statistic for this type of analysis

Conditions for ANOVA:

  • Independence - within groups (samples observations must be independent) and across groups (must be non-paired)
    • Generally assumed to be OK with a properly stratified and randomized sample that is reasonably small relative to the population
    • The between groups pairing can be handled with techniques not covered during this course
  • Approximate normality within each group
  • Equal variance within each group
    • Especially important when sample sizes are significantly different across groups

Post-hoc testing - determining which of the means are different:

  • Can run t-tests for each group comparison, though this will epxlode the Type I error rate
    • Can instead use a modified significance level for each individual test to maintain the desired overall Type I error rate
    • The Bonferroni correction is common - newAlpha = tgtAlpha / K where K = k * (k-1) / 2 with k being the number of groups
  • Since there has been an assumption of constant variance, can use a consistent standard error and degrees of freedom for all the tests

Wrap-up:

  • Simulation-based and CLM-based inference
  • Single variables and bivariate variables
  • Two levels and multiple levels

Example code includes:

gss <- readr::read_csv("./RInputFiles/gss_wordsum_class.csv")
## Parsed with column specification:
## cols(
##   wordsum = col_integer(),
##   class = col_character()
## )
str(gss)
## Classes 'tbl_df', 'tbl' and 'data.frame':    795 obs. of  2 variables:
##  $ wordsum: int  6 9 6 5 6 6 8 10 8 9 ...
##  $ class  : chr  "MIDDLE" "WORKING" "WORKING" "WORKING" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 2
##   .. ..$ wordsum: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ class  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
ggplot(gss, aes(x=wordsum)) +
  geom_histogram(binwidth=1) +
  facet_grid(class ~ .)

aov_wordsum_class <- aov(wordsum ~ class, data=gss)
broom::tidy(aov_wordsum_class)
##        term  df     sumsq    meansq statistic      p.value
## 1     class   3  236.5644 78.854810  21.73467 1.560565e-13
## 2 Residuals 791 2869.8003  3.628066        NA           NA
gss %>%
  group_by(class) %>%
  summarize(s = sd(wordsum))
## # A tibble: 4 x 2
##   class       s
##   <chr>   <dbl>
## 1 LOWER    2.24
## 2 MIDDLE   1.89
## 3 UPPER    2.34
## 4 WORKING  1.87
# Conduct the pairwise.t.test with p.adjust = "none" option (we'll adjust the significance level, not the p-value). The first argument is the response vector and the second argument is the grouping vector.
pairwise.t.test(gss$wordsum, gss$class, p.adjust = "none") %>%
  broom::tidy()
##    group1 group2      p.value
## 1  MIDDLE  LOWER 1.133345e-07
## 2   UPPER  LOWER 4.752521e-02
## 3 WORKING  LOWER 3.055619e-02
## 5   UPPER MIDDLE 2.395734e-01
## 6 WORKING MIDDLE 1.631637e-12
## 9 WORKING  UPPER 3.670775e-01

Introduction to Statistics with R: Correlation and Linear Regression

Chapter 1 - Introduction to Correlation Coefficients

How are correlation coefficients calculated?

  • Can be calculated using the raw-score formula or the Z-score formula
  • The general formula for calculating the correlation coefficient between two variables is
    • r=cov(A,B) / [sA * sB]
    • where cov(A,B) is the covariance between A and B, while sA and sB are the standard deviations
  • The covariance is defined as follows
    • diff_A = A - mean(A)
    • diff_B = B - mean(B)
    • cov(A, B) = sum(diff_A * diff_B) / (length(A) - 1) # A and B need to be of the same length, so length(A) or length(B) will do
  • The standard deviation is defined as the sample standard deviation, so (length(A) - 1) is in the denominator prior to the square root being taken
    • sd_A = sqrt( sum(diff_A ** 2) / (length(A) - 1) )

Usefulness of correlation coefficients:

  • Correlation can range between +1 (perfect positive correlation), -1 (perfect negative correlation), and 0 (no linear relationship)
  • When variables are strongly correlated, knowing one variable can help you predict another variable
    • Working memory capacity is strongly correlated with intelligence and IQ

Points of caution:

  • Correlation does not imply causation
  • The magnitude of correlation depends on many factors - sampling (full random vs. targeted population), measurement (reliable and valid), etc.
    • Attenuation of correlation due to restriction of range - correlation on college graduates only may not work well
  • Correlation coefficient is a sample statistic, just like the mean

Example code includes:

PE <- read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header=TRUE)
# Take a quick peek at both vectors
(A <- c(1, 2, 3))
## [1] 1 2 3
(B <- c(3, 6, 7))
## [1] 3 6 7
# Save the differences of each vector element with the mean in a new variable
diff_A <- A - mean(A)
diff_B <- B - mean(B)

# Do the summation of the elements of the vectors and divide by N-1 in order to acquire the covariance between the two vectors
cov <- sum(diff_A*diff_B)/ (length(A)-1)


# Square the differences that were found in the previous step
sq_diff_A <- diff_A ** 2
sq_diff_B <- diff_B ** 2

# Take the sum of the elements, divide them by N-1 and consequently take the square root to acquire the sample standard deviations
sd_A <- sqrt(sum(sq_diff_A)/(length(A)-1))
sd_B <- sqrt(sum(sq_diff_B)/(length(B)-1))


# Combine all the pieces of the puzzle
correlation <- cov / (sd_A * sd_B)
correlation
## [1] 0.9607689
# Check the validity of your result with the cor() command
cor(A, B)
## [1] 0.9607689
# Read data from a URL into a dataframe called PE (physical endurance) - moved above to cache
# PE <- read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header=TRUE)

# Summary statistics
psych::describe(PE)
##             vars   n   mean    sd median trimmed   mad min max range skew
## pid            1 200 101.81 58.85  101.5  101.71 74.87   1 204   203 0.01
## age            2 200  49.41 10.48   48.0   49.46 10.38  20  82    62 0.06
## activeyears    3 200  10.68  4.69   11.0   10.57  4.45   0  26    26 0.30
## endurance      4 200  26.50 10.84   27.0   26.22 10.38   3  55    52 0.22
##             kurtosis   se
## pid            -1.21 4.16
## age            -0.14 0.74
## activeyears     0.46 0.33
## endurance      -0.44 0.77
# Scatter plots
plot(PE$age ~ PE$activeyears)

plot(PE$endurance ~ PE$activeyears)

plot(PE$endurance ~ PE$age)

# Correlation Analysis
round(cor(PE[, !(names(PE) == "pid")]), 2)
##               age activeyears endurance
## age          1.00        0.33     -0.08
## activeyears  0.33        1.00      0.33
## endurance   -0.08        0.33      1.00
# Do some correlation tests. If the null hypothesis of no correlation can be rejected on a significance level of 5%, then the relationship between variables is  significantly different from zero at the 95% confidence level
cor.test(PE$age, PE$activeyears)
## 
##  Pearson's product-moment correlation
## 
## data:  PE$age and PE$activeyears
## t = 4.9022, df = 198, p-value = 1.969e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1993491 0.4473145
## sample estimates:
##       cor 
## 0.3289909
cor.test(PE$endurance, PE$activeyears)
## 
##  Pearson's product-moment correlation
## 
## data:  PE$endurance and PE$activeyears
## t = 4.8613, df = 198, p-value = 2.37e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1967110 0.4451154
## sample estimates:
##       cor 
## 0.3265402
cor.test(PE$endurance, PE$age)
## 
##  Pearson's product-moment correlation
## 
## data:  PE$endurance and PE$age
## t = -1.1981, df = 198, p-value = 0.2323
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.22097811  0.05454491
## sample estimates:
##         cor 
## -0.08483813
# The impact dataset is already loaded in
rawImpactData <- " 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, control, control, control, control, control, control, control, control, control, control, control, control, control, control, control, control, control, control, control, control, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, concussed, 95, 90, 87, 84, 92, 89, 78, 97, 93, 90, 89, 97, 79, 86, 85, 85, 98, 95, 96, 92, 79, 85, 97, 89, 75, 75, 84, 93, 88, 97, 93, 96, 84, 89, 95, 95, 97, 95, 92, 95, 88, 82, 77, 72, 77, 79, 63, 82, 85, 66, 76, 79, 60, 59, 60, 76, 85, 83, 67, 84, 81, 85, 91, 74, 63, 68, 78, 74, 80, 73, 74, 70, 81, 72, 90, 74, 70, 63, 65, 69, 35.29, 31.47, 30.87, 41.87, 33.28, 40.73, 38.09, 31.65, 39.59, 30.53, 33.65, 37.51, 40.39, 32.88, 33.39, 35.13, 38.51, 29.64, 35.32, 27.36, 27.19, 32.66, 26.29, 28.92, 32.77, 32.92, 34.26, 36.08, 31.63, 28.89, 35.81, 33.61, 34.46, 39.18, 33.14, 33.03, 39.01, 35.06, 30.58, 38.45, 0.42, 0.63, 0.56, 0.66, 0.56, 0.81, 0.66, 0.79, 0.68, 0.60, 0.74, 0.51, 0.82, 0.59, 0.82, 0.63, 0.73, 0.57, 0.65, 1.00, 0.57, 0.71, 0.82, 0.61, 0.72, 0.50, 0.54, 0.65, 0.66, 0.71, 0.55, 0.79, 0.48, 0.55, 1.20, 0.73, 0.60, 0.84, 0.60, 0.42, 11,  7,  8,  7,  7,  6,  6, 10,  7, 10,  7,  7, 12,  2,  9, 10, 10,  8,  5, 11,  7,  9,  9,  9,  8,  9,  6, 10,  9,  7,  9,  7,  7, 10, 10, 11, 10,  5,  8, 11, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 97, 86, 90, 85, 87, 91, 90, 94, 91, 93, 92, 89, 84, 81, 85, 87, 96, 93, 95, 93, 63, 79, 91, 85, 74, 72, 80, 59, 75, 90, 66, 85, 72, 82, 80, 59, 74, 62, 67, 66, 86, 80, 79, 70, 77, 85, 60, 72, 83, 68, 72, 79, 67, 71, 61, 72, 78, 85, 67, 80, 75, 79, 80, 72, 56, 66, 74, 69, 79, 73, 69, 61, 79, 66, 80, 70, 62, 54, 57, 63, 35.61, 37.01, 20.15, 33.26, 28.34, 33.47, 44.28, 36.14, 37.42, 25.19, 23.63, 26.32, 43.70, 32.40, 39.32, 35.62, 39.95, 35.62, 30.21, 30.37, 29.23, 44.45, 26.12, 27.98, 60.77, 31.91, 49.62, 35.68, 55.67, 25.70, 35.21, 33.01, 37.46, 53.20, 33.20, 34.59, 39.66, 35.09, 32.30, 44.49, 0.65, 0.49, 0.75, 0.19, 0.59, 0.48, 0.77, 0.90, 0.65, 0.59, 0.55, 0.56, 0.57, 0.69, 0.73, 0.48, 0.43, 0.37, 0.47, 0.50, 0.61, 0.65, 1.12, 0.65, 0.71, 0.79, 0.64, 0.70, 0.68, 0.73, 0.58, 0.97, 0.56, 0.51, 1.30, 0.70, 0.74, 1.24, 0.65, 0.98, 10,  7,  9,  8,  8,  5,  6, 10,  8, 11,  9,  9, 10,  3, 10, 12, 10,  9,  5, 11,  3,  6,  5,  5,  1,  9,  7, 11,  6,  3,  4,  3,  1,  7,  7,  4,  5,  2,  6,  5,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0, 26, 34, 27, 22, 26, 35, 43, 31, 39, 25, 31, 38, 14, 16, 33, 13, 27, 15, 19, 39"
rawImpactNames <- c('subject', 'condition', 'vermem1', 'vismem1', 'vms1', 'rt1', 'ic1', 'sym1', 'vermem2', 'vismem2', 'vms2', 'rt2', 'ic2', 'sym2')
splitImpactData <- stringr::str_split(rawImpactData, ",")
impactRawMatrix <- matrix(data=splitImpactData[[1]], ncol=length(rawImpactNames))
colnames(impactRawMatrix) <- rawImpactNames

rawImpactDF <- as.data.frame(impactRawMatrix, stringsAsFactors=FALSE)
for (intCtr in c(1, 3:ncol(rawImpactDF))) { rawImpactDF[, intCtr] <- as.numeric(rawImpactDF[, intCtr]) }
rawImpactDF$condition <- factor(stringr::str_replace_all(rawImpactDF$condition, " ", ""))
impact <- rawImpactDF


# Summary statistics entire dataset
psych::describe(impact)
##            vars  n  mean    sd median trimmed   mad   min   max range
## subject       1 40 20.50 11.69  20.50   20.50 14.83  1.00 40.00 39.00
## condition*    2 40  1.50  0.51   1.50    1.50  0.74  1.00  2.00  1.00
## vermem1       3 40 89.75  6.44  91.00   90.44  6.67 75.00 98.00 23.00
## vismem1       4 40 74.88  8.60  75.00   74.97  9.64 59.00 91.00 32.00
## vms1          5 40 34.03  3.90  33.50   34.02  3.62 26.29 41.87 15.58
## rt1           6 40  0.67  0.15   0.65    0.66  0.13  0.42  1.20  0.78
## ic1           7 40  8.28  2.05   8.50    8.38  2.22  2.00 12.00 10.00
## sym1          8 40  0.05  0.22   0.00    0.00  0.00  0.00  1.00  1.00
## vermem2       9 40 82.00 11.02  85.00   82.97  9.64 59.00 97.00 38.00
## vismem2      10 40 71.90  8.42  72.00   72.19 10.38 54.00 86.00 32.00
## vms2         11 40 35.83  8.66  35.15   34.98  6.89 20.15 60.77 40.62
## rt2          12 40  0.67  0.22   0.65    0.65  0.13  0.19  1.30  1.11
## ic2          13 40  6.75  2.98   7.00    6.81  2.97  1.00 12.00 11.00
## sym2         14 40 13.88 15.32   7.00   12.38 10.38  0.00 43.00 43.00
##             skew kurtosis   se
## subject     0.00    -1.29 1.85
## condition*  0.00    -2.05 0.08
## vermem1    -0.70    -0.51 1.02
## vismem1    -0.11    -0.96 1.36
## vms1        0.08    -0.75 0.62
## rt1         1.14     2.21 0.02
## ic1        -0.57     0.36 0.32
## sym1        3.98    14.16 0.03
## vermem2    -0.65    -0.81 1.74
## vismem2    -0.28    -0.87 1.33
## vms2        0.86     0.65 1.37
## rt2         0.93     1.29 0.03
## ic2        -0.16    -1.06 0.47
## sym2        0.44    -1.47 2.42
# Calculate correlation coefficient
entirecorr <- round(cor(impact$vismem2, impact$vermem2), 2)

# Summary statistics subsets
psych::describeBy(impact, impact$condition)
## 
##  Descriptive statistics by group 
## group: concussed
##            vars  n  mean    sd median trimmed   mad   min   max range
## subject       1 20 30.50  5.92  30.50   30.50  7.41 21.00 40.00 19.00
## condition*    2 20  1.00  0.00   1.00    1.00  0.00  1.00  1.00  0.00
## vermem1       3 20 89.65  7.17  92.50   90.56  5.93 75.00 97.00 22.00
## vismem1       4 20 74.75  8.03  74.00   74.25  8.15 63.00 91.00 28.00
## vms1          5 20 33.20  3.62  33.09   33.27  3.32 26.29 39.18 12.89
## rt1           6 20  0.66  0.17   0.63    0.64  0.13  0.42  1.20  0.78
## ic1           7 20  8.55  1.64   9.00    8.62  1.48  5.00 11.00  6.00
## sym1          8 20  0.05  0.22   0.00    0.00  0.00  0.00  1.00  1.00
## vermem2       9 20 74.05  9.86  74.00   73.88 11.86 59.00 91.00 32.00
## vismem2      10 20 69.20  8.38  69.50   69.62 10.38 54.00 80.00 26.00
## vms2         11 20 38.27 10.01  35.15   37.32  7.73 25.70 60.77 35.07
## rt2          12 20  0.78  0.23   0.70    0.74  0.11  0.51  1.30  0.79
## ic2          13 20  5.00  2.53   5.00    4.88  2.97  1.00 11.00 10.00
## sym2         14 20 27.65  9.07  27.00   27.75 11.12 13.00 43.00 30.00
##             skew kurtosis   se
## subject     0.00    -1.38 1.32
## condition*   NaN      NaN 0.00
## vermem1    -0.79    -0.70 1.60
## vismem1     0.45    -0.72 1.80
## vms1       -0.13    -0.78 0.81
## rt1         1.38     2.41 0.04
## ic1        -0.39    -0.81 0.37
## sym1        3.82    13.29 0.05
## vermem2     0.07    -1.24 2.21
## vismem2    -0.27    -1.26 1.87
## vms2        0.77    -0.57 2.24
## rt2         1.09    -0.10 0.05
## ic2         0.39    -0.28 0.57
## sym2       -0.11    -1.25 2.03
## -------------------------------------------------------- 
## group: control
##            vars  n  mean   sd median trimmed  mad   min   max range  skew
## subject       1 20 10.50 5.92  10.50   10.50 7.41  1.00 20.00 19.00  0.00
## condition*    2 20  2.00 0.00   2.00    2.00 0.00  2.00  2.00  0.00   NaN
## vermem1       3 20 89.85 5.82  90.00   90.31 7.41 78.00 98.00 20.00 -0.41
## vismem1       4 20 75.00 9.34  77.00   75.50 9.64 59.00 88.00 29.00 -0.46
## vms1          5 20 34.86 4.09  34.39   34.85 4.92 27.36 41.87 14.51  0.09
## rt1           6 20  0.67 0.13   0.66    0.67 0.13  0.42  1.00  0.58  0.47
## ic1           7 20  8.00 2.41   7.50    8.12 2.22  2.00 12.00 10.00 -0.41
## sym1          8 20  0.05 0.22   0.00    0.00 0.00  0.00  1.00  1.00  3.82
## vermem2       9 20 89.95 4.36  90.50   90.06 5.19 81.00 97.00 16.00 -0.25
## vismem2      10 20 74.60 7.76  74.50   75.00 8.15 60.00 86.00 26.00 -0.23
## vms2         11 20 33.40 6.44  34.54   33.52 6.30 20.15 44.28 24.13 -0.25
## rt2          12 20  0.57 0.16   0.56    0.57 0.13  0.19  0.90  0.71 -0.16
## ic2          13 20  8.50 2.31   9.00    8.69 1.48  3.00 12.00  9.00 -0.73
## sym2         14 20  0.10 0.31   0.00    0.00 0.00  0.00  1.00  1.00  2.47
##            kurtosis   se
## subject       -1.38 1.32
## condition*      NaN 0.00
## vermem1       -0.87 1.30
## vismem1       -1.27 2.09
## vms1          -1.19 0.91
## rt1           -0.02 0.03
## ic1           -0.17 0.54
## sym1          13.29 0.05
## vermem2       -1.02 0.97
## vismem2       -1.11 1.73
## vms2          -0.77 1.44
## rt2            0.06 0.04
## ic2           -0.32 0.52
## sym2           4.32 0.07
# Create 2 subsets: control and concussed
control <- subset(impact, condition == "control")
concussed <- subset(impact, condition == "concussed")

# Calculate correlation coefficients for each subset
controlcorr <- round(cor(control$vismem2, control$vermem2), 2)
concussedcorr <- round(cor(concussed$vismem2, concussed$vermem2), 2)

# Display all values at the same time
correlations <- cbind(entirecorr, controlcorr, concussedcorr)
correlations
##      entirecorr controlcorr concussedcorr
## [1,]       0.45        0.37          0.35

Chapter 2 - Introduction to Linear Regression

Introduction to regression:

  • Linear regression uses correlations to make predictions about one variable by knowing other variable(s)
  • Simple regression has only a single predictor variable while multiple regression has many predictor variables

Regression equations and the R-squared value:

  • Simple regression (one predictor) vs. multiple regression (2+ predictors)
    • y = m (intercept) + b (slope) * x + e (error or residual)
    • y = Bo + B1 * X1 + e (more generlized equation - allows for Bi * Xi for as many i as needed)
  • R is the multiple correlation coefficient, which is the correlation between the predicted outcomes and the actual outcomes
    • R^2 is the percentage of variance explained by the model
  • The regression equation would then be
    • sym2=B0+B1(ic2)+e,
    • where B0 is the intercept, B1 the slope and e the residual error
  • There are five values required for the calculation of a regression line for this model:
    • The mean of sym2: mean_sym2,
    • The mean of ic2: mean_ic2,
    • The standard deviation of sym2: sd_sym2,
    • The standard deviation of ic2: sd_ic2, and
    • The correlation coefficient between sym2 and ic2: r
  • The general formula for the slope is:
    • B1=r*sd(y)/sd(x)
  • The general formula for the intercept is
    • B0=mean(y)−B1*mean(x)

Multiple linear regression:

  • Adding predictor variables (or better predictor variables) can increase the predictive power of the regression

Example code includes:

# Look at the dataset. Note that the variables we are interested in are on the 9th to 14th columns
head(impact)
##   subject condition vermem1 vismem1  vms1  rt1 ic1 sym1 vermem2 vismem2
## 1       1   control      95      88 35.29 0.42  11    0      97      86
## 2       2   control      90      82 31.47 0.63   7    0      86      80
## 3       3   control      87      77 30.87 0.56   8    0      90      79
## 4       4   control      84      72 41.87 0.66   7    0      85      70
## 5       5   control      92      77 33.28 0.56   7    1      87      77
## 6       6   control      89      79 40.73 0.81   6    0      91      85
##    vms2  rt2 ic2 sym2
## 1 35.61 0.65  10    0
## 2 37.01 0.49   7    0
## 3 20.15 0.75   9    0
## 4 33.26 0.19   8    0
## 5 28.34 0.59   8    1
## 6 33.47 0.48   5    0
# Create a correlation matrix for the dataset
correlations <- cor(impact[, 9:14])

# Create the scatterplot matrix for the dataset
corrplot::corrplot(correlations)

# Calculate the required means, standard deviations and correlation coefficient
mean_sym2 <- mean(impact$sym2)
mean_ic2 <- mean(impact$ic2)
sd_sym2 <- sd(impact$sym2)
sd_ic2 <- sd(impact$ic2)
r <- cor(impact$ic2,impact$sym2)

# Calculate the slope
B_1 <- r * ( sd_sym2 )/( sd_ic2 )

# Calculate the intercept
B_0 <- mean_sym2 - B_1 * mean_ic2

# Plot of ic2 against sym2
plot(x=impact$ic2, y=impact$sym2, main = "Scatterplot", ylab = "Symptoms", xlab = "Impulse Control")

# Add the regression line
abline(B_0, B_1, col = "red")

# Construct the regression model
model_1 <- lm(impact$sym2 ~ impact$ic2)

# Look at the results of the regression by using the summary function
summary(model_1)
## 
## Call:
## lm(formula = impact$sym2 ~ impact$ic2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.441  -8.983  -5.309   9.127  29.696 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.2945     5.5090   5.318  4.9e-06 ***
## impact$ic2   -2.2844     0.7483  -3.053  0.00413 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.91 on 38 degrees of freedom
## Multiple R-squared:  0.1969, Adjusted R-squared:  0.1758 
## F-statistic: 9.319 on 1 and 38 DF,  p-value: 0.004125
# Create a scatter plot of Impulse Control against Symptom Score
plot(impact$sym2 ~ impact$ic2, main = "Scatterplot", ylab = "Symptoms", xlab = "Impulse Control")

# Add a regression line
abline(model_1, col = "red")

# Multiple Regression
model_2 <- lm(impact$sym2 ~ impact$ic2 + impact$vermem2)

# Examine the results of the regression
summary(model_2)
## 
## Call:
## lm(formula = impact$sym2 ~ impact$ic2 + impact$vermem2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.274  -8.031  -2.703   6.245  27.962 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     79.7639    14.7765   5.398  4.1e-06 ***
## impact$ic2      -1.0711     0.7335  -1.460 0.152690    
## impact$vermem2  -0.7154     0.1981  -3.611 0.000898 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.12 on 37 degrees of freedom
## Multiple R-squared:  0.4062, Adjusted R-squared:  0.3742 
## F-statistic: 12.66 on 2 and 37 DF,  p-value: 6.482e-05
# Extract the predicted values
predicted <- fitted(model_2)

# Plotting predicted scores against observed scores
plot(predicted ~ impact$sym2, main = "Scatterplot", xlab = "Observed Scores", ylab = "Predicted Scores")
abline(lm(predicted ~ impact$sym2), col = "green")


Chapter 3 - Linear Regression Models (cont)

Estimation of coefficients - key concept is to minimize the residuals (specifically, residuals-squared):

  • Ordinary Least Squares (OLS) is the process of minimizing the sum-squared of the residuals
  • There is a total sum-squared of the dependent variable, and a model sum-squared (portion that is explained by the model)

Estimation of standardized and unstandardized regression coefficients:

  • For simple regression, B1 = r * (sdY / sdX)
    • If X and Y have been standardized, the B1 = r
  • Executing a standardized linear regression in R is very similar to executing an unstandardized linear regression but involves the extra step of standardizing the variables by using the scale() function

Assumptions of linear regression:

  • Normal distribution for Y
  • Linear relationship between X and Y
  • Homoscedasticity (constant variance)
  • Reliable/valid/representative measures for both X and Y
  • General process for assessing these is to examine the residuals

Anscombe’s quartet:

  • Data set with identical correlations and identical sd(x), sd(y) for 4 highly distinct data sets
    • Same regression line for Y = 3 + 0.5 * X
  • Calculate the residuals and plot them against the x variable(s)
    • If there is any trend or pattern, then the assumptions for the linear regression have been violated

Example code includes:

# Create a linear regression with `ic2` and `vismem2` as regressors
model_1 <- lm(impact$sym2 ~ impact$ic2 + impact$vismem2)

# Extract the predicted values
predicted_1 <- fitted(model_1)

# Calculate the squared deviation of the predicted values from the observed values
deviation_1 <- (impact$sym2 - predicted_1) ** 2

# Sum the squared deviations
SSR_1 <- sum(deviation_1)
SSR_1
## [1] 7236.338
# Create a linear regression with `ic2` and `vermem2` as regressors
model_2 <- lm(impact$sym2 ~ impact$ic2 + impact$vermem2)

# Extract the predicted values
predicted_2 <- fitted(model_2)

# Calculate the squared deviation of the predicted values from the observed values
deviation_2 <- (impact$sym2 - predicted_2) ** 2

# Sum the squared deviations
SSR_2 <- sum(deviation_2)
SSR_2
## [1] 5435.454
# Create a standardized simple linear regression
model_1_z <- lm(scale(impact$sym2) ~ scale(impact$ic2))

#Look at the output of this regression model
summary(model_1_z)
## 
## Call:
## lm(formula = scale(impact$sym2) ~ scale(impact$ic2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4648 -0.5863 -0.3465  0.5958  1.9383 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -4.487e-16  1.435e-01   0.000  1.00000   
## scale(impact$ic2) -4.438e-01  1.454e-01  -3.053  0.00413 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9078 on 38 degrees of freedom
## Multiple R-squared:  0.1969, Adjusted R-squared:  0.1758 
## F-statistic: 9.319 on 1 and 38 DF,  p-value: 0.004125
# Extract the R-Squared value for this regression
r_square_1 <- summary(model_1_z)$r.square

#Calculate the correlation coefficient
corr_coef_1 <- sqrt(r_square_1)


# Create a standardized multiple linear regression
model_2_z <- lm(scale(impact$sym2) ~ scale(impact$ic2) + scale(impact$vismem2))

# Look at the output of this regression model
summary(model_2_z)
## 
## Call:
## lm(formula = scale(impact$sym2) ~ scale(impact$ic2) + scale(impact$vismem2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4349 -0.5949 -0.3174  0.5331  1.9646 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           -5.450e-16  1.443e-01   0.000   1.0000  
## scale(impact$ic2)     -4.101e-01  1.526e-01  -2.688   0.0107 *
## scale(impact$vismem2) -1.171e-01  1.526e-01  -0.767   0.4479  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9128 on 37 degrees of freedom
## Multiple R-squared:  0.2095, Adjusted R-squared:  0.1668 
## F-statistic: 4.904 on 2 and 37 DF,  p-value: 0.01291
# Extract the R-Squared value for this regression
r_square_2 <- summary(model_2_z)$r.squared

# Calculate the correlation coefficient
corr_coef_2 <- sqrt(r_square_2)


# Extract the residuals from the model
residual <- resid(model_2)

# Draw a histogram of the residuals
hist(residual)

# Extract the predicted symptom scores from the model
predicted <- fitted(model_2)

# Plot the residuals against the predicted symptom scores
plot(residual ~ predicted, main = "Scatterplot", xlab="Model 2 Predicted Scores", ylab="Model 2 Residuals" )
abline(lm(residual ~ predicted), col="red")


Inference for Linear Regression

Chapter 1 - Inferential Ideas

Variability in regression lines:

  • Different samples would produce different regression lines; question is the magnitude of the impact of sampling variability
  • Can take many bootstrap samples, calculate the regressions for each, and use broom::tidy() to get the various slopes

Research question - linear modeling for relationships between fat, carbohydrates, and calories in Starbucks food:

  • Can look at either a one-sided or two-sided test of relationships among any two of the variables
  • The standard error gives a sense for the uncertainty of the least-squares point estimate
  • The p-value reported in the lm defaults to being a two-sided t-test for the intercept and coefficient estimates
    • Can divide the p-value by 2 for a 1-sided test (which is only appropriate if the original research question was one-sided)

Variability of coefficients:

  • Using the BikeTrail data from masiacData, can look at bicycle volume as a function of high temperature
  • There is frequently variability in the intercept/coefficient estimates based on differences between the sample and the population
    • The “tighter” the data, generally the tighter the standard errors for the intercept/coefficient estimates
    • When there is less variability along the x-axis, generally the uncertainty in the slope increases

Example code includes:

# Load the mosaicData package and the RailTrail data
library(mosaicData)
data(RailTrail)

# Fit a linear model
ride_lm <- lm(volume ~ hightemp, data=RailTrail)

# View the summary of your model
summary(ride_lm)
## 
## Call:
## lm(formula = volume ~ hightemp, data = RailTrail)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -254.562  -57.800    8.737   57.352  314.035 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.079     59.395  -0.288    0.774    
## hightemp       5.702      0.848   6.724 1.71e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104.2 on 88 degrees of freedom
## Multiple R-squared:  0.3394, Adjusted R-squared:  0.3319 
## F-statistic: 45.21 on 1 and 88 DF,  p-value: 1.705e-09
# Print the tidy model output
ride_lm %>% broom::tidy()
##          term   estimate  std.error  statistic      p.value
## 1 (Intercept) -17.079281 59.3953040 -0.2875527 7.743652e-01
## 2    hightemp   5.701878  0.8480074  6.7238541 1.705138e-09
expData1 <- c(-4.3, 0.19, -2.59, -0.43, 0.59, -2.74, 3.09, 3.51, 0.56, 5.89, 0.36, -0.01, 2.59, 1.51, 2.89, -8.26, -0.46, 3.28, 4.85, 1.16, 3.03, 2.24, 1.78, -0.26, 4.29, 6.92, -6.34, 0.49, 3.4, 3.08, 2.1, -1.93, 3.72, 0.52, -4.65, 4.24, -1.21, 5.15, -10.43, 6.46, -2.78, 0.7, 2.93, -4.84, -7.08, -3.98, 8.27, -4.51, -5.22, -2.17, 2.32, 0.37, -2.53, 3.2, -8.02, -1.82, -6.17, 1.45, -0.19, -0.91, -2.02, 1.13, 11.2, 4.43, 0.88, -0.28, -9.29, 0.18, -6.9, 0.44, -9.1, -1.21, 11.32, -3.3, 3.56, 1.28, 5.76, -2.73, -9.69, -4.43, 5.71, 1.09, -8.28, -7.12, -0.33, -4.3, 4.16, 4.83, -0.29, -3.78, 5.03, 12.3, 4.79, 0.69, -11.06, 3.73, -6.64, -0.24, 5.08, -0.48, 0.68, 4.43, 2.11, 1.8, 2.98, -4.84, -3.9, 4.1, 0.05, -7.43, -2.41, 1.14, -1.87, 11.12, 6.26, 1.29, -4.54, 5.38, 3.09, -4.59, 8.55, -4.21, -0.92, 0.79, -3.48, -6.13, 3.58, 4.54, -4.83, -13.5, 1.58, -1.03, 1.34, -1.46, 5.53, -4.23, -6.95, 6.17, -0.89, 9.95, -4.12, 0.08, 2.49, -8.42, -2.4, -6.96, 7.92, -5.04, -0.25, -0.63, 8.4, 4.18, -4.86, 0.99, -5.54, -4.23, -2.23, 2.21, -0.05, -2.67, -1.14, 3.3, -5.48, 3.86, 2.1, 4.81, -1.09, -10.97, -16.68, -8.58, 3.78, 5.94, 0.35, 0.14, -8.6, -3.44, -5.14, -6.65, -0.49, -1.99, 3.54, 4.7, -0.61, 8.69, 0.91, 0.71, 3.6, -3.1, -2.99, 5.82, 3.84, 0.82, -2.74, -6.27, -3.03, 1.29, 1.58, 1.76, 4.64, -7.24, 1.54, 0.83, -0.6, -0.29, 0.78, -8.42, 9.76, 14.35, -1.09, -13.42, -1.72, 4.49, -0.02, -0.47, 8.93, 5.27, -6.06, 12.66, 0.53, -3.08, 0.52, -0.71, -0.39, -1.11, -1.72, 8.66, -1.41, 2.77, 1.03, -6.97, 7.57, -10.75, -0.88, -2.53, 1.64, 6.48, -1.61, -1.98, -5.91, 7.25, -1.67, 4.26, -7.22, 6.03, 2.92, 4.08, 9.65, -12.34, 1.24, 3.76, 3.25, -9.13, -3.23, 0.51, -1.52, -3.44, 6.75, -0.18, -3.92, -4.14, 1.14, 6.44, -0.32, 5.91, -3.55, -8.99, -6.38, -2.64, -1.47, -3.91, 12.07, 5.55, -7.94, 10.98, -6.57, -3.43, -1.13, 9.51, 11.19, -3.21, -3.19, -7.94, 2.4)
expData2 <- c(-4.59, 6.5, 3.8, -6.42, 0.78, -2.4, -2.55, -3.2, -6.3, 4.69, -0.05, 5.71, 6.5, 3.69, -4.75, 4.87, -2.42, -5.04, 3.75, 1.69, -0.19, 8.33, 2.8, -0.09, 6.24, -3.73, -2.64, 8.11, -4.43, 4.42, 3.46, -6.71, -5.47, 6.84, 4.94, 2.23, 0.92, 1.56, -3.52, -5.42, -1.04, -4.33, -0.63, -1.72, -5.42, -8.92, -4.8, -6.53, 3.33, 3.39, 4.08, -3.03, -5.11, 7.04, -0.93, -2.56, -1.45, 8.75, -4.01, -5.87, 3.36, 5.83, 1.13, -1.25, -0.04, 0.23, 0.95, 3.16, -7.17, 12.37, -9.98, -9.73, 1.55, -8.56, 13.58, 0.56, 6.39, 2.34, -5.11, 6.48, -1.62, -1.16, -6.37, 7.48, 3.51, 4.82, 1.73, -0.48, -0.84, 2.58, -3.24, -1.33, 4.69, -0.99, 9.78, -16.75, -2.92, 10.15, -4.64, 5.66, 0.89, 2.11, 1.66, 3.78, 3.43, -1.09, -1.43, -10.07, -0.87, 4.41, -3.55, -1.66, 8.28, 8.3, 1.03, 6.42, -0.33, -2.63, -4.12, 6.68, -1.32, 10.69, 7.11, -3.75, 1.16, 5.19, -4.41, -4.13, -3.32, -8.24, -3.19, 1.1, 5.45, 2.19, -10.27, -0.87, -1.32, -2.77, 7.39, -14.48, -2.06, -3.46, -4.21, -6.55, -1.59, -0.44, -3.11, -4.21, -8.38, 0.01, 10.58, 3.05, 3.67, -2.52, 2.05, -2, 7.04, -0.42, -12.23, -0.44, -1.66, -1.31, -0.16, 1.72, -3.25, 2.56, -0.21, -1.59, 2.35, -2.5, 0.44, 8.61, 2.83, 10.75, 1.1, -0.89, 4.89, -0.91, 1.83, 3.2, -1.16, -3.23, 0.96, 2.59, 6.36, -0.53, -4.2, -1.13, 2.37, -1.06, -3.69, -0.25, 8.21, -5.84, -5.53, -3.03, -0.79, -0.72, -0.67, 3.23, -6.51, 2.06, -0.4, 0.75, -2.39, -2.27, -3.65, -7.56, 3.24, -4.05, -4.2, -5.91, 5.24, -11.65, -4.16, -5.99, 1.22, 1.32, -3.63, -0.9, -3.52, -5.25, 8.05, 4.09, -3.22, 5.71, 0.67, -5.46, -5.24, -1.7, -6.4, 0.48, 4.49, 15.97, -1.42, 2.41, -1.75, 4.77, -4.45, 0.88, 0.24, 11.64, -0.51, 1.58, 4.18, -3.51, 2.32, -2.15, -5.42, 5.6, 4.18, -4.82, -1.41, -5.32, 0.58, 1.23, -5.35, -5.88, 0.76, -2.81, 0.59, -2.26, 4.05, 0.32, 5.97, 4.22, -1.79, 3.28, -4.16, -4.88, -1.24, -7.38, -2.67, -4.56, 2.45, 4.92, 1.84, -1.6, 4.79, -4.02, -9.2, 6.78, -8.21, -0.18, -4.02, 4.84, 2.81, -2.65, -4.72, -0.83, -4.69, 7.94, 3.53, 4.25, 5.06, 7.88, -1.08, -0.78, 3.41, -10.45, 0.16, 0.13, -0.6, 1.82, 5.68, 5.7, 4.66, -5.4, 7.12, -2.49, 1.5, 1.27, -8.26, 0.58, 0.04, 3.17, -3.23, -0.66, -3.2, 1.59, -4, -1.96, -3.48, -3.4, -3.95, 4.52, 2.5, -3.37, -14.81, -3.22, -3.57, 2.44, 0.17, 4.8, -6.15, -3.4, -4.1, -2.68, 5.86, 2.92, -0.19, -4.64, 9.4, 6.49, -5.84, -6.62, 2.86, -3.56, -4.6, -4.87, 7.32, 3.82, 8.99, -1.46, 4.98, -1.41, -5.89, -8.86, 6.87, -7.25, 2.67, 2.81, -0.22, -3.37, 6.74, 3.33, 4.72, 1.02, -3.02, -4.9, 2, 3.41, 0.5, -7.36, 6.36, 4, 2.24, -6.75, -5.62, -8.14, 1.82, 6.23, -0.18, 10.71, -0.57, 1.38, 9.5, 1.12, 3.08, 0.08, -4.75, 4.23, -2.23, -0.82, 1.84, -1.15, 4.12, -5.86, -0.16, -6.5, 4.86)
expData3 <- c(-0.62, 1.5, 5.44, -1.68, -10.04, 11.49, 1.48, -1.82, 1.57, 3.06, -2.36, -7.98, 0.25, -1.77, 3.32, 1.72, -7.55, 7.24, 2.78, -4.41, -2.55, -1.3, -1.49, 2.78, -4.37, -4.41, 0.57, 0.7, -0.56, 0.17, -2.52, 0.5, 2.46, -5.55, 2.98, 0.51, -0.28, 3.97, 6.74, 0.14, 3.54, 0.38, -2.69, 1.59, 3.09, -2.73, 4.93, 7.43, 1.76, 0.77, 4.54, 3.69, 5.75, -2.68, -1.01, 6.47, 1.91, -3.48, -2.91, 3.62, -3.72, 2.09, 0.63, -6.95, -0.66, -8.25, 6.6, -3.02, 3.51, 11.77, -1.78, -1.57, 5.58, -0.44, 3.07, -2.54, -3.1, 3.77, 8.05, -2.44, -0.95, 3.73, 1.64, 7.64, 3.63, 3.39, 1.71, -6.25, -3.47, 1.6, 3.49, 0.94, 0.18, -4.29, -2.62, 14.57, -1.73, 1.79, 2.54, -2.94, -0.56, 6.87, -4.81, 6.45, 4.2, 1.65, 8.4, 7.45, 7.11, 5.56, 1.06, -8.52, -7.68, -6.63, -4.09, 0.16, -6.08, -5.78, -4.46, -1.35, 3.34, -0.51, -3.65, -3.82, 0.64, 8.2, 14.07, -0.87, 3.3, 1.7, -3.17, -0.57, -1.06, 5.74, 0.79, -5.42, -2.22, 3.72, 2.88, -5.73, 0.82, -3.04, 6.11, 7.04, 2.84, 0.29, -2.37, 4.49, -5, -4.09, 0.33, 0.34, 0.81, 2.11, -1.55, -0.75, -7.49, 6.03, 0.14, 3.58, -0.67, 7.74, 5.55, 5.44, -8.21, 8.48, 2.15, 0.04, -3.68, 6.09, 4.06, 2.85, 2.47, -1.37, 3.66, 0.63, 0.46, -1.82, -7.6, 0.05, -3.03, -7.56, 1.56, 2.44, -2.56, -9.01, -0.19, -5.88, -7.51, -5.84, 3.79, -18, 5.33, -4.15, -6.26, 0.53, 15.21, 4.85, 1.98, -1.25, -1.12, -5.65, -0.96, 11.19, 2.76, -2.89, 0.49, -1.83, -2.52, -1.03, -1.54, -1.22, 4.27, -2.39, 0, -3.61, 0.93, -8.6, -4.41, 5.23, -3.77, 0.99, -6.99, -1.57, 3, -0.47, -3.44, 10.14, 2.8, 1.28, -0.16, 8, 4.47, 1.46, 0.86, -3.14, 1.47, 2.22, -0.05, -1.66, 3.6, -2.25, -5.84, 5.91, 2.39, 4.85, 5.07, -2.37, 0.86, -4.37, -3.32, 2.24, -3.78, 1.35, 0.01, -1.53, -3.88, 2.32, -4.27, -1.08, 4.45, 3.55, 1.82, 11.33, 1.49, -1.67, 0.49, -3.35, 0.26, -2.57, -2.51, -13.35, 6.11, 8.47, 3.94, -4.56, -3.28, 3.92, 5.81, -3.57, -1.75, -4.77, 4, -3.46, -2.25, -0.94, -4.16, -11.13, 5.81, -3.29, 5.69, 10.75, 4.29, -0.21, -0.38, 6.03, -1.97, -4.57, 7.61, -5.07, 3.82, 1.73, 8.15, 5.79, 0.19, 1.28, 3.23, -5.88, -10.91, 9.61, -0.47, 6.15, -6.18, -0.29, 1.76, 0.34)

respData1 <- c(27.8, 39.19, 39.31, 42.6, 46.38, 40.21, 44.47, 46.24, 34.38, 69.78, 47.47, 53.41, 52.07, 47.09, 49.82, 13.05, 37.2, 54.27, 42.01, 31.94, 46.56, 10.89, 34.58, 44.43, 51.34, 44.57, 23.28, 46.32, 38.84, 50.78, 34.92, 35.1, 59.31, 40.65, 26.79, 37.85, 45.41, 52.6, 19.58, 36.63, 17.9, 63.94, 51.59, 19.05, 14.15, 37.03, 66.97, 15.58, 29.71, 43.78, 47.02, 37.27, 30.03, 43.11, 36.41, 32.44, 42.13, 46.72, 39.5, 32.4, 45.52, 26.89, 72.18, 51.75, 46.9, 41.5, 22.07, 46.82, 17.87, 50.12, 18.44, 28.21, 68.83, 24.07, 49.43, 43.31, 53.94, 26.36, 7.55, 17.13, 64.75, 36.93, 8.65, 21.06, 44.15, 40.35, 27.84, 42.75, 38.86, 21.84, 52.34, 63.13, 43.23, 38.48, 25.56, 37.81, 19.7, 32.33, 51.69, 40.01, 35.01, 60.59, 47.98, 32.92, 62.64, 15.48, 28.79, 46.04, 60.79, 32.6, 55.21, 41.05, 33.99, 58.24, 50.12, 43.4, 38.2, 58.34, 40.5, 25.68, 60.69, 44.46, 25.28, 40.56, 42.48, 32.15, 52.42, 56.78, 31.09, 18.29, 53.15, 30.62, 43.09, 35.78, 56.31, 20.42, 23.26, 48.99, 26.23, 61, 41.67, 41.04, 11.61, 41.64, 50.24, 18.98, 48.7, 17.97, 38, 50.85, 63.39, 57.49, 19.51, 54.11, 18.01, 33.74, 19.89, 44.66, 23.09, 42.45, 47.84, 39.38, 26.44, 25.24, 46.74, 33.03, 35.28, 35.73, -2.21, 20.49, 54.54, 50.42, 34.82, 47.67, 13.75, 44.62, 33.73, 31.53, 42.63, 36.64, 55.48, 49.84, 41.98, 69.24, 48.39, 39.12, 40.55, 41.95, 29.31, 34.22, 32.13, 33.6, 14.66, 23.75, 31.9, 35.76, 29, 50.02, 51.85, 13, 43.69, 45.67, 39.06, 43.92, 47.04, 11.32, 66.35, 56.47, 46.27, -0.56, 58.99, 57.85, 50.48, 24.43, 62.28, 49.07, 29.16, 63.71, 35.43, 25.9, 27.7, 40.02, 36.33, 43.11, 28.98, 51.88, 45.73, 48.29, 44.55, 28.76, 60.29, 14.26, 38.09, 43.13, 47.68, 60.55, 47.78, 29.85, 26.43, 62.71, 31.78, 38.87, 28.99, 56.19, 17.08, 44.7, 51.59, 9.56, 39.35, 48.91, 35.22, 26.53, 36.73, 43.78, 50.2, 32.55, 53.92, 33.67, 32.58, 34.44, 41.82, 51.16, 21.73, 53.09, 35.07, 6.84, 30.26, 33.74, 54.12, 32.41, 57.36, 52.16, 22, 64.32, 42.23, 51.91, 44.38, 47.45, 57.47, 48.04, 28.02, 21.08, 52.87, 30.48, 52.76, 51.07, 21.07, 37.38, 27.2, 35.8, 35.36, 50.08, 61.72, 27.94, 54.76, 57.88, 50.57, 38.38, 58.27, 13.5, 26.88, 33.78, 67.2, 31.6, 43.14, 43.19, 43.39, 43.09, 30.85, 34.1, 72.42, 15.42, 66.27, 43.71, 28.42, 13.3, 46.63, 35.42, 52.04, 55.89, 49.6, 44.02, 24.67, 46.79, 37.98, 39.42, 23.08, 26.36, 30.27, 57.94, 22.79, 60.07, 36.51, 49.22, 22.53, 29.96, 62.01, 42.3, 30.05, 55.57, 51.68, 23.05, 28.9, 47.02, 63.76, 42.48, 56.21, 44.29, 28.75, 43.53, 29.18, 14.44, 67.68, 16.92, -1, 51.54, 24.68, 77.5, 72.04, 51.44, 59.89, 5.48, 65.2, 35.03, 29.17, 25.99, 65.76, 54.67, 51.44, 51.34, 25.71, 25.48, 45.49, 37.31, 25.55, 59.4, 23.38, 46.47, 5.27, 48.51, 59.98, 34.85, 48.4, 62.56, 27.1, 41.17, 60.38, 57.21, 17.7, 39.84, 9.25, 39.82, 60.86, 53.29, 33.74, 66.61, 66.06, 50.5, 67.98, 21.79, 25.02, 48.24, 69.45, 35.39, 67.24, 53.73, 25.21, 43.4, 50.39, 30.88, 44.33, 6.28)
respData2 <- c(18.26, 36.11, 25.88, 46.64, 38.22, -2.89, 24.17, 32.85, 37.09, 52.34, 12.35, 50.37, 31.17, 22.07, 42.49, 51.39, 34.57, 25.83, 28.45, 37.21, 41.36, 71.02, 46.11, 39.2, 36.16, 41.46, 40.83, 59.73, 30.75, 1.55, 26.67, 49.85, 35.61, 50.58, 39.23, 40.37, 45.63, 35.29, 46.06, 44.51, 36.47, 51.52, 46.69, 36.55, 53.82, 66.62, 47.33, 54.24, 27.13, 45.48, 53.06, 21.14, 52.55, 51.62, 47.59, 40.92, 27.66, 34.75, 30.73, 61.73, 36.3, 48.03, 34.31, 61.52, 27.52, 32.32, 45.33, 31.56, 32.85, 33.09, 44.52, 15.36, 48.63, 37.8, 45.67, 42.7, 40.2, 13.56, 29.57, 48.77, 29.23, 34.8, 26.91, 50.36, 29.04, 27.91, 0.98, 44.37, 29.11, 36.25, 37.73, 36.64, 18.95, 41.73, 54.34, 34.53, 56.36, 44.06, 39.32, 21.44, 53.32, 27.81, 49.95, 49.67, 68.74, 31.31, 57.06, 33.94, 39.64, 7.46, 52.8, 34.4, 70.42, 32.35, 49.45, 47.13, 34.96, 42.26, 38.28, 28.98, 37.15, 49.47, 29.23, 31.53, 28.17, 35.08, 32.34, 33.13, 33.98, 31.66, 29.63, 38.07, 49.42, 48.03, 45.81, 42.76, 63.01, 31.79, 36.26, 35.28, 34.19, 24.96, 9.13, 30.69, 36.83, 22.96, 52.03, 52.85, 49.96, 54.53, 31.88, 14.11, 51.02, 28.36, 40.92, 53.8, 63.55, 49.42, 16.49, 26.25, 34.56, 34.24, 29.5, 56.65, 33.47, 57.91, 54.78, 40.52, 41.14, 43.87, 28.43, 25.15, 38.2, 52.35, 40.83, 58.92, 37.48, 50.09, 33.76, 46.91, 30.51, 52.1, 45.28, 25.65, 28.95, 43.69, 49.32, 32.96, 34.64, 45.21, 30.77, 43.83, 45.89, 27.21, 38.51, 23.67, 37.26, 49.04, 42.06, 7.7, 36.93, 20.52, 29.9, 30.13, 55.83, 18.76, 35.06, 36.68, 34.92, 59.64, 41.81, 22.45, 28.44, 77.59, 59.44, 19.26, 34.14, 63.37, 24.93, 24.94, 16.79, 38.96, 34.77, 55, 43.88, 43.47, 29, 38.99, 20.2, 59.86, 42.71, 52.67, 47.27, 42.85, 15.67, 54, 68.67)
respData3 <- c(52.15, 49.27, 49.14, 40.6, 50.46, 58.2, 45.66, 20.46, 49.97, 42.83, 25.73, 28.75, 27.69, 22.29, 53.53, 52.18, 41.86, 69.86, 37.78, 31.75, 46.9, 32.98, 52.79, 35.54, 34.51, 26.63, 39.69, 27, 47.79, 19.5, 45.63, 26.92, 44.14, 16.93, 42.78, 56.02, 47.01, 46.89, 19.48, 25.95, 61.74, 47.83, 45.1, 41.11, 64.28, 27.7, 8.05, 49.37, 43.05, 49.23, 33.99, 25.97, 60.66, 44.42, 37.06, 40.95, 21.97, 18.88, 34.68, 41.47, 35.65, 36.49, 31.45, 36.02, 38.67, 28.87, 47.33, 48.99, 39.26, 59.34, 57.07, 39.02, 35.12, 50.94, 48.7, 59.07, 36.1, 41.1, 48.53, 36.84, 20.57, 67.63, 52.86, 28.83, 47.47, 43.59, 63.2, 59.74, 23.63, 44.15, 49.71, 41.84, 28.01, 24.4, 36.04, 46.16, 39.15, 50.76, 27.61, 11.67, 27.04, 39.95, 19.58, 31.08, 70.48, 41.18, 21.73, 51.49, 45.3, 61.05, 18.11, 26.11, 41.49, 51.15, 45.32, 35.03, 45.41, 48.5, 41.7, 56.59, 44.21, 51.48, 22.45, 38.89, 42.35, 47.44, 44.4, 47.43, 38.98, 15.87, 73.93, 57.64, 55.71, 56.87, 40.79, 35.31, 56.81, 31.64, 51.35, 40.61, 44.84, 63.09, 56.03, 36.39, 54.85, 35.81, 5.16, 25.41, 26.6, 39.91, 18.4, 30.47, 28.81, 35.49, 21.98, 54.58, 37.43, 45.14, 26.94, 42.56, 61.86, 63.86, 22.83, 41.93, 43.3, 35.72, 30.3, 23.57, 62.44, 28.81, 45.11, 38.81, 37.09, 35.89, 28.37, 25.34, 20.95, 37.1, 50.87, 44.03, 25.36, 46.32, 40.63, 13.77, 33.89, 35.41, 42.71, 57.71, 46.37, 48.02, 42.58, 20.81, 64.81, 42.34, 44.93, 24.07, 71.44, 49.04, 65.73, 30.5, 56.97, 77.54, 37.53, 49.04, 52.37, 44.21, 49.64, 41.62, 33.15, 26.12, 59.01, 50.39, 12.17, 26.18, 51.92, 35.1, 45.64, 62.54, 47.18, 23.61, 13.8, 48.7, 22.64, 23.25, 26.07, 65.36, -3.6, 44.56, 20.4, 28.77, 44.92, 52.98, 38.6, 46.51, 37.8, 41.54, 14.45, 40.69, 61.11, 55.09, 30.34, 39.57, 32.23, 48.52, 44.47, 27.24, 40.09, 48.87, 31.78, 44.45, 50.95, 36.98, 24.06, 22, 39.89, 33.79, 52.91, 18.36, 32.73, 65.36, 39.55, 36.56, 61.26, 70.61, 48.07, 41.77, 83.77, 62.77, 36.37, 38.26, 31.61, 46.1, 58.54, 24.53, 39.71, 58.49, 36, 28.21, 41.23, 31.57, 31.77, 42.54, 22.47, 48.5, 46.3, 30.97, 53.55, 23.35, 60.6, 41.07, 46.19, 22.14, 52.2, 29.76, 42.34, 43.52, 38.48, 49.56, 59.15, 29.13, 20.9, 40.13, 25.61, 56.45, 35.77, 34.6, 27.61, 37.08, 42.26, 36.76, 23.48, 27.94, 43.68, 49.03, 54.34, 57.83, 45.74, 54, 41.25, 36.96, 56.99, 25.3, 37.1, 54.32, 39.65, 59.93, 53.08, 32.52, 25.58, 34.33, 50.6, 49.97, 23.38, 66.08, 39.08, 47.11, 54.24, 55.54, 45.4, 44.52, 36.92, 45.72, 29.69, 13.78, 41.59, 27.96, 40.98, 19.86, 37.85, 23.43, 41.28)

popdata <- data.frame(explanatory=c(expData1, expData2, expData3), 
                      response=c(respData1, respData2, respData3)
                      )
str(popdata)
## 'data.frame':    1000 obs. of  2 variables:
##  $ explanatory: num  -4.3 0.19 -2.59 -0.43 0.59 -2.74 3.09 3.51 0.56 5.89 ...
##  $ response   : num  27.8 39.2 39.3 42.6 46.4 ...
# Plot the whole dataset
ggplot(popdata, aes(x = explanatory, y = response)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) 

# Take 2 samples of size 50
set.seed(4747)
sample1 <- popdata %>% sample_n(50)
sample2 <- popdata %>% sample_n(50)

# Plot sample1
plot1 <- ggplot(sample1, aes(x = explanatory, y = response)) + 
  geom_point(color = "blue") + 
  geom_smooth(method = "lm", se = FALSE, color = "blue")

plot1 

# Plot sample2 over sample1
plot1 + geom_point(data = sample2, 
                   aes(x = explanatory, y = response),
                   color = "red") +
  geom_smooth(data = sample2, 
              aes(x = explanatory, y = response), 
              method = "lm", 
              se = FALSE, 
              color = "red")

# Repeatedly sample the population
manysamples <- infer::rep_sample_n(popdata, size=50, reps=100)

# Plot the regression lines
ggplot(manysamples, aes(x=explanatory, y=response, group=replicate)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) 

# Fit and tidy many linear models
manylms <- manysamples %>% 
  group_by(replicate) %>% 
  do(lm(response ~ explanatory, data=.)  %>% 
     broom::tidy()) %>%
  filter(term=="explanatory")

# Plot a histogram of the slope coefficients
ggplot(manylms, aes(x=estimate)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Take 100 samples of size 50
manysamples1 <- infer::rep_sample_n(popdata, size=50, reps=100)

# Plot the regression line for each sample
ggplot(manysamples1, aes(x=explanatory, y=response, group=replicate)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) 

# Take 100 samples of size 10
manysamples2 <- infer::rep_sample_n(popdata, size=10, reps=100)

# Plot the regression line for each sample
ggplot(manysamples2, aes(x=explanatory, y=response, group=replicate)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) 

# In order to understand the sampling distribution associated with the slope coefficient, it is valuable to visualize the impact changes in the sample and population have on the slope coefficient. Here, reducing the variance associated with the response variable around the line changes the variability associated with the slope statistics.
# The new popdata is already loaded in your workspace.
# Take 100 samples of size 50
oldPopData <- popdata

popdata$response <- (oldPopData$response - mean(oldPopData$response)) / sd(oldPopData$response)
popdata$response <- 40 + popdata$response * 11.152

manysamples <- infer::rep_sample_n(popdata, size=50, reps=100)

# Plot a regression line for each sample
ggplot(manysamples, aes(x=explanatory, y=response, group=replicate)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) 


Chapter 2 - Simulation Based Inference for Slope Parameters

Simulation-based inference - using the twins study from the 1920s (one twin was raise by their parents and the other in a foster home):

  • The regression analysis looks at Foster vs Biological on an IQ test
  • Can instead permute the response variable (Foster) and see what baseline variability in slopes would be
    • Can then assess the likelihood of seeing the actual regression slope in the data; that becomes the p-value
  • Can run the analyses using the “infer” package; specifically
    • twins %>%
    • specify(Foster ~ Biological) %>%
    • hypothesize(null = “independence”) %>%
    • generate(reps = 10, type = “permute”) %>%
    • calculate(stat = “slope”)
  • Typically, to do inference, you will need to know the sampling distribution of the slope under the hypothesis that there is no relationship between the explanatory and response variables
    • In most situations, you don’t know the population from which the data came, so the null sampling distribution must be derived from only the original dataset
    • In this exercise you’ll use the pull() function. This function takes a data frame and returns a selected column as a vector (similar to $)

Simulation-based inference for slope - can also be calculated using bootstrap for CI (as opposed to testing a null-hypothesis):

  • The bootstrap will count some sets of (x, y) 2+ times, and some sets of (x, y) 0 times; there is no permuting of the data, though
  • While the permuted slopes were all centered around zero (as null hypotheses), the resamples slopes will be centered around the test-statistic slope (as confidence intervals)
  • Can run the analyses using the “infer” package; specifically
    • twins %>%
    • specify(Foster ~ Biological) %>%
    • generate(reps = 100, type = “bootstrap”) %>%
    • calculate(stat = “slope”)

Example code includes:

# Load the infer package
library(infer)

twins <- readr::read_csv("./RInputFiles/twins.csv")
## Parsed with column specification:
## cols(
##   Foster = col_integer(),
##   Biological = col_integer(),
##   Social = col_character()
## )
str(twins)
## Classes 'tbl_df', 'tbl' and 'data.frame':    27 obs. of  3 variables:
##  $ Foster    : int  82 80 88 108 116 117 132 71 75 93 ...
##  $ Biological: int  82 90 91 115 115 129 131 78 79 82 ...
##  $ Social    : chr  "high" "high" "high" "high" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ Foster    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Biological: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Social    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
# Calculate the observed slope
obs_slope <- lm(Foster ~ Biological, data=twins) %>%
  broom::tidy() %>%   
  filter(term == "Biological") %>%
  pull(estimate)

# Simulate 10 slopes with a permuted dataset
set.seed(4747)
perm_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 10, type = "permute") %>%
  calculate(stat = "slope") 

# Print the observed slope and the 10 permuted slopes
obs_slope
## [1] 0.901436
perm_slope
## # A tibble: 10 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  0.143 
##  2         2  0.0710
##  3         3 -0.456 
##  4         4  0.0749
##  5         5  0.297 
##  6         6  0.0673
##  7         7  0.140 
##  8         8  0.164 
##  9         9  0.0971
## 10        10  0.184
# Make a dataframe with replicates and plot them!
set.seed(4747)
perm_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 500, type = "permute") %>%
  calculate(stat = "slope") 

ggplot(perm_slope, aes(x=stat)) +
  geom_density()

# Calculate the mean and the standard deviation of the slopes
mean(perm_slope$stat)
## [1] 0.006285095
sd(perm_slope$stat)
## [1] 0.1963073
# Calculate the absolute value of the slope
abs_obs_slope <- lm(Foster ~ Biological, data=twins) %>%
  broom::tidy() %>%   
  filter(term == "Biological") %>%
  pull(estimate) %>%
  abs()

# Compute the p-value  
perm_slope %>% 
  mutate(abs_perm_slope=abs(stat)) %>%
  summarize(p_value = mean(abs_perm_slope > abs_obs_slope))
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0
# Calculate 1000 bootstrapped slopes
set.seed(4747)
BS_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "slope") 

# Look at the head of BS_slope  
head(BS_slope)
## # A tibble: 6 x 2
##   replicate  stat
##       <int> <dbl>
## 1         1 0.946
## 2         2 0.966
## 3         3 0.870
## 4         4 0.930
## 5         5 0.807
## 6         6 0.900
# Create a confidence interval
BS_slope %>% 
summarize(lower = mean(stat) - 2 *sd(stat),
          upper = mean(stat) + 2 *sd(stat))
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.719  1.08
# Set alpha
alpha <- 0.05

# Create a confidence interval  
BS_slope %>% 
summarize(low = quantile(stat, alpha/2), 
          high = quantile(stat, 1 - alpha/2))
## # A tibble: 1 x 2
##     low  high
##   <dbl> <dbl>
## 1 0.724  1.08

Chapter 3 - t-Based Inference for the Slope Parameter

Mathematical approximation for testing and estimating slope parameters (based on the t-distribution):

  • Continuing with the Starbucks data for the relationships between fat and calories
  • Can look at the histogram of 1) simulation slopes for the permuted null hypothesis, and 2) t-distribution for df=n-2; should see very good overlaps
    • The dt function is the density of the t-distribution; this is used by R behind the scenese to select the p-value
  • Sometimes, the t-distribution is not such a good match to the histograms of the null hypothesis
    • R will use the t-distribution even if it is not a perfect fit; further reinforcement of the value of plotting the data
    • The differences between the mathematical model and the data is only really important in the edge cases (minor, low-power impact)
  • In thinking about the scientific research question, if IQ is caused only by genetics, then we would expect the slope of the line between the two sets of twins to be 1
    • Testing the hypothesized slope value of 1 can be done by making a new test statistic which evaluates how far the observed slope is from the hypothesized value of 1
    • newt=(slope−1) / SE
    • If the hypothesis that the slope equals one is true, then the new test statistic will have a t-distribution which we can use for calculating a p-value
  • When technical conditions (see next chapter) hold, the inference from the randomization test and the t-distribution test should give equivalent conclusions
    • They will not provide the exact same answer because they are based on different methods
    • But they should give p-values and confidence intervals that are reasonably close

Intervals in regression - estimating the coefficients by way of confidence intervals (CI):

  • CI = point-estimate +/- critical_value * SE_estimate
  • Can use the tidy() call to get the CI, such as tidy(lm(…), conf.int=TRUE, conf.level=1-alpha)

Different types of intervals - often of interest to know the variability in the predicted value, not just the parameter estimates:

  • There is greater uncertainty in predicted values as you get further away from the “center of mass” of the x-data
  • Can use the broom::augment() call to get the range of predicted values for a new set of data
    • alpha <- .05
    • crit_val <- qt((1-alpha/2), df = nrow(starbucks) - 2)
    • newfood <- data.frame(Fat = c(0,10,20,30))
    • augment(lm(Calories ~ Fat, data=starbucks), newdata = newfood) %>%
    • mutate(lowMean = .fitted - crit_val.se.fit, upMean = .fitted + crit_val.se.fit)
  • Can create plots for the population estimate error using ggplot
    • ggplot(predMeans, aes(x = Fat, y = Calories)) +
    • geom_point() +
    • stat_smooth(method = “lm”, se = FALSE) + # can alternately set se=TRUE to get a similar plot
    • geom_ribbon(aes(ymin = lowMean, ymax = upMean), alpha=.2)
  • The prediction interval can also be requested to give the CI on the individual (rather than population mean) members of the population given a specific X variable
    • This is a combination of both the error in the population mean predictions AND the natural variability in the population on the y-metric
    • FatCal_pred <- augment(FatCal_lm) %>%
    • mutate(.se.pred = sqrt(FatCal_sig^2 + .se.fit^2))
    • predResp <- FatCal_pred %>%
    • mutate(lowResp = .fitted - crit_val.se.pred, upResp = .fitted + crit_val.se.pred)
    • ggplot(predResp, aes(x = Fat, y = Calories)) +
    • geom_point() +
    • stat_smooth(method = “lm”, se = FALSE) +
    • geom_ribbon(aes(ymin = lowResp, ymax = upResp), alpha = .2)

Example code includes:

# twins_perm <- twins %>%
#   specify(Foster ~ Biological) %>%
#   hypothesize(null="independence") %>%
#   generate(reps = 10, type = "permute") %>%
#   calculate(stat = "slope") 

# The randomized slopes are given in the twins_perm dataframe
# Look at the head of the data
# head(twins_perm)

# Plot the histogram with the t distribution
# twins_perm %>%
#   filter(term == "Biological_perm") %>%
#   ggplot(aes(x=statistic)) + 
#   geom_histogram(aes(y = ..density..), bins = 50) + 
#   stat_function(fun = dt, color = "red", args = list(df=nrow(twins)-2))


# Tidy the model
lm(Foster ~ Biological, data=twins) %>% broom::tidy()
##          term estimate  std.error statistic      p.value
## 1 (Intercept) 9.207599 9.29989643 0.9900754 3.316237e-01
## 2  Biological 0.901436 0.09633286 9.3575128 1.203600e-09
# Create a one-sided p-value
lm(Foster ~ Biological, data=twins) %>%
  broom::tidy() %>% 
  filter(term == "Biological") %>%
  select(p.value) %>%
  mutate(p_value_1side = p.value/2)
##      p.value p_value_1side
## 1 1.2036e-09     6.018e-10
# Test the new hypothesis
lm(Foster ~ Biological, data = twins) %>% 
  broom::tidy() %>% 
  filter(term == "Biological") %>%
  mutate(statistic_test1 = (estimate - 1) / std.error, 
      p_value_test1 = 2 * pt(abs(statistic_test1), df=nrow(twins)-2, lower.tail=FALSE))
##         term estimate  std.error statistic    p.value statistic_test1
## 1 Biological 0.901436 0.09633286  9.357513 1.2036e-09       -1.023161
##   p_value_test1
## 1     0.3160311
# Find the p-value
# perm_slope %>%
#   mutate(abs_perm_slope = abs(stat)) %>%
#   summarize(p_value = mean(abs_perm_slope > abs(obs_slope)))


# Set alpha
alpha <- 0.05

# Find the critical value
crit_val <- qt(0.975, df = nrow(twins)-2)

# Tidy the model with the confidence level alpha
lm(Foster ~ Biological, data=twins) %>% 
   broom::tidy(conf.int=TRUE, conf.level=1-alpha)
##          term estimate  std.error statistic      p.value   conf.low
## 1 (Intercept) 9.207599 9.29989643 0.9900754 3.316237e-01 -9.9458964
## 2  Biological 0.901436 0.09633286 9.3575128 1.203600e-09  0.7030348
##   conf.high
## 1 28.361094
## 2  1.099837
# Find the lower and upper bounds of the confidence interval
lm(Foster ~ Biological, data=twins) %>%
    broom::tidy() %>%
    mutate(lower = estimate - crit_val * std.error,
           upper = estimate + crit_val * std.error)
##          term estimate  std.error statistic      p.value      lower
## 1 (Intercept) 9.207599 9.29989643 0.9900754 3.316237e-01 -9.9458964
## 2  Biological 0.901436 0.09633286 9.3575128 1.203600e-09  0.7030348
##       upper
## 1 28.361094
## 2  1.099837
# Create the bootstrap confidence interval
BS_slope %>% 
    summarize(low = quantile(stat, alpha/2), 
              high = quantile(stat, 1 - alpha/2))
## # A tibble: 1 x 2
##     low  high
##   <dbl> <dbl>
## 1 0.724  1.08
# Set alpha
alpha <- 0.05

# Find the critical value
crit_val <- qt(1-alpha/2, nrow(twins)-2)


# Create a dataframe of new observations
newtwins <- data.frame(Biological = c(80, 90, 100, 110))

# Find prediction intervals
lm(Foster ~ Biological, data=twins) %>% 
  broom::augment(newdata = newtwins) %>%
  mutate(lowMean = .fitted - crit_val * .se.fit,
         upMean = .fitted + crit_val * .se.fit)
##   Biological   .fitted  .se.fit   lowMean    upMean
## 1         80  81.32248 2.093789  77.01024  85.63472
## 2         90  90.33684 1.572563  87.09809  93.57559
## 3        100  99.35120 1.554979  96.14866 102.55374
## 4        110 108.36556 2.054014 104.13524 112.59588
# Set alpha and find the critical value
alpha <- 0.05
crit_val <- qt(1-alpha/2, df=nrow(twins)-2)

# Find confidence intervals for the response
predMeans <- lm(Foster ~ Biological, data=twins) %>%
  broom::augment() %>%  
  mutate(lowMean = .fitted - crit_val*.se.fit,
      upMean = .fitted + crit_val*.se.fit) 

# Examine the intervals
head(predMeans)
##   Foster Biological   .fitted  .se.fit     .resid       .hat   .sigma
## 1     82         82  83.12535 1.962975  -1.125351 0.06449924 7.885059
## 2     80         90  90.33684 1.572563 -10.336839 0.04139435 7.588558
## 3     88         91  91.23827 1.543999  -3.238275 0.03990427 7.859737
## 4    108        115 112.87274 2.411531  -4.872739 0.09734432 7.818859
## 5    116        115 112.87274 2.411531   3.127261 0.09734432 7.859970
## 6    117        129 125.49284 3.571300  -8.492843 0.21349013 7.642607
##        .cooksd .std.resid   lowMean    upMean
## 1 0.0007811552 -0.1505319  79.08253  87.16817
## 2 0.0402839133 -1.3659358  87.09809  93.57559
## 3 0.0037993804 -0.4275816  88.05835  94.41820
## 4 0.0237414774 -0.6635515 107.90610 117.83938
## 5 0.0097788932  0.4258588 107.90610 117.83938
## 6 0.2083390637 -1.2389779 118.13761 132.84807
# Plot the data with geom_ribbon()
ggplot(predMeans, aes(x=Biological, y=Foster)) + 
  geom_point() +
  stat_smooth(method="lm", se=FALSE) + 
  geom_ribbon(aes(ymin = lowMean, ymax = upMean), alpha=.2)

# Plot the data with stat_smooth()
ggplot(twins, aes(x = Biological, y = Foster)) + 
  geom_point() +
  stat_smooth(method="lm", se=TRUE) 

# Set alpha and find the critical value
alpha <- 0.05
crit_val <- qt(1-alpha/2, nrow(twins)-2)

# Fit a model and use glance to find sigma
twin_lm <- lm(Foster ~ Biological, data=twins)
twin_gl <- broom::glance(twin_lm)

# Pull sigma
twin_sig <- pull(twin_gl, sigma)

# Augment the model to find the prediction standard errors
twin_pred <- broom::augment(twin_lm) %>%
  mutate(.se.pred = sqrt(twin_sig ** 2 + .se.fit ** 2))

# Create prediction intervals  
predResp <- twin_pred %>%
  mutate(lowResp = .fitted - crit_val * .se.pred,
      upResp = .fitted + crit_val * .se.pred)

# Plot the intervals using geom_ribbon()
ggplot(predResp, aes(x=Biological, y=Foster)) + 
  geom_point() +
  stat_smooth(method="lm", se=FALSE) + 
  geom_ribbon(aes(ymin = lowResp, ymax = upResp), alpha = .2) +
  geom_ribbon(data = predMeans, aes(ymin = lowMean, ymax = upMean), alpha = .2, fill = "red")


Chapter 4 - Technical Conditions in Linear Regression